# neural_active_learning_beyond_bandits__361746b7.pdf Published as a conference paper at ICLR 2024 NEURAL ACTIVE LEARNING BEYOND BANDITS Yikun Ban1, Ishika Agarwal1, Ziwei Wu1, Yada Zhu2, Kommy Weldemariam2, Hanghang Tong1, Jingrui He1 1University of Illinois Urbana-Champaign, 2IBM Research 1{yikunb2, ishikaa2, ziweiwu2, htong, jingrui}@illinois.edu 2{yzhu, kommy}@us.ibm.com We study both stream-based and pool-based active learning with neural network approximations. A recent line of works proposed bandit-based approaches that transformed active learning into a bandit problem, achieving both theoretical and empirical success. However, the performance and computational costs of these methods may be susceptible to the number of classes, denoted as K, due to this transformation. Therefore, this paper seeks to answer the question: "How can we mitigate the adverse impacts of K while retaining the advantages of principled exploration and provable performance guarantees in active learning?" To tackle this challenge, we propose two algorithms based on the newly designed exploitation and exploration neural networks for stream-based and pool-based active learning. Subsequently, we provide theoretical performance guarantees for both algorithms in a non-parametric setting, demonstrating a slower error-growth rate concerning K for the proposed approaches. We use extensive experiments to evaluate the proposed algorithms, which consistently outperform state-of-the-art baselines. 1 INTRODUCTION Active learning is one of the primary areas in machine learning to investigate the learning technique on a small subset of labeled data while acquiring good generalization performance compared to passive learning [19]. There are mainly two settings of active learning: stream-based and pool-based settings. For the stream-based setting, the learner is presented with an instance drawn from some distribution in each round and is required to decide on-the-fly whether or not to query the label from the oracle. For the pool-based setting, the learner aims to select one or multiple instances from the unlabeled pool and hand them over to the oracle for labeling. It repeats this process until the label budget is exhausted [51]. The essence of active learning is to exploit the knowledge extracted from labeled instances and explore the unlabeled data to maximize information acquisition for long-term benefits. Using neural networks (NNs) to perform active learning has been explored extensively in recent works [48; 52; 56; 6]. However, they often lack a provable performance guarantee despite strong empirical performance. To address this issue, a recent line of works [58; 14] proposed the banditbased approaches to solve the active learning problem, which are equipped with principled exploration and theoretical performance guarantee. In contextual bandits [38; 68], the learner is presented with K arms (context vectors) and required to select one arm in each round. Then, the associated reward is observed. [58; 14] transformed the online K-class classification into a bandit problem. Specifically, in one round of stream-based active learning, a data instance xt Rd is transformed into K long vectors corresponding to K arms, matching K classes: xt,1 = [x t , 0 , , 0 ] , . . . , xt,K = [0 , , 0 , x t ] , where xt,k Rd K, k [K]. Then, the learner uses an NN model to calculate a score for each arm and selects an arm based on these scores. The index of the selected arm represents the index of the predicted class. This design enables researchers to utilize the exploration strategy and analysis in contextual bandits to solve the active learning problem. Note [58; 14] can only handle the stream-based setting of active learning. However, bandit-based approaches bear the following two limitations. First, as the instance xt is transformed into K arms, it is required to calculate a score for all K arms respectively, producing a cost of K times forward-propagation computation of neural networks. This computation cost is scaled by K. Second, the transformed long vector (arm) has (Kd) dimensions, in contrast to the Published as a conference paper at ICLR 2024 Table 1: Test accuracy and running time compared to bandit-based methods in stream-based setting. Adult MT Letter Covertype Shuttle Fashion # Number of classes 2 2 2 7 7 10 Accuracy Neur AL-NTK [58] 80.3 0.12 76.9 0.15 79.3 0.21 61.9 0.08 95.3 0.20 64.5 0.16 I-Neur AL [14] 84.2 0.22 79.4 0.16 82.9 0.06 65.2 0.19 99.3 0.12 73.5 0.28 NEURONAL-S 84.8 0.51 83.7 0.17 86.5 0.16 74.4 0.19 99.5 0.09 83.2 0.38 Running Time Neur AL-NTK [58] 163.2 1.31 259.4 2.48 134.0 3.44 461.2 1.26 384.7 1.86 1819.4 10.84 I-Neur AL [14] 102.4 7.53 46.2 5.58 232.2 3.80 1051.7 5.85 503.1 9.66 1712.7 12.8 NEURONAL-S 54.7 3.21 10.5 0.39 92.1 3.4 166.4 0.59 101.2 2.32 116.3 3.39 d dimensions of the original instance as the input of the NN model. This potentially amplifies the effects of K on an active learning algorithm s performance. We empirically evaluate [58; 14] as shown in Table 1. The results indicate a noticeable degradation in both test accuracy and running time as K increases. In response, in this paper, we aim to mitigate the adverse effects of K on the bandit-based approach in active learning. Our methods are built upon and beyond [14]. [14] adopted the idea of [13] to employ two neural networks, one for exploitation and another for exploration. As previously mentioned, these two neural networks take the transformed Kd-dimension arm as input. Moreover, in each round, [14] decomposed the label vector yt {0, 1}K into K rewards (scalars), necessitating the training of two neural networks K times for each arm. Next, we summarize our key ideas and contributions to reduce the input dimension back to d and the number of forward propagations to 1 in each round while preserving the essence of exploitation and exploration of neural networks. Methodology. (1) We extend the loss function in active learning from 0-1 loss to Bounded loss, which is more flexible and general. Instead, [58; 14] restricted the loss to be 0-1 loss, because they had to define the reward of each class (arm) due to their bandit-based methodology. (2) We re-designed the input and output exploitation and exploration neural networks to directly take the d-dimension instance as input and output the predicted probabilities for K classes synchronously, mitigating the curse of K. The connection between exploitation and exploration neural networks is also reconstructed beyond the standard bandit setting. In other words, we avoid the transformation of active learning to the standard bandit setting. This is the first main contribution of this paper. (3) To facilitate efficient and effective exploration, we introduce the end-to-end embedding (Definition 4.1) as the input of the exploration neural network, which removes the dependence of the input dimension while preserving the essential information. (4) In addition to our proposed stream-based algorithm, referred to NEURONAL-S, we also propose a pool-based active learning algorithm, NEURONAL-P. We bring the redesigned exploitation and exploration network into pool-based setting and propose a novel gap-inverse-based selection strategy tailored for pool-based active learning. This is our second main contribution. Note that the stream-based algorithms cannot be directly converted into the pool-based setting, as discussed in Appendix B. Theoretical analysis. We provide the regret upper bounds for the proposed stream-based algorithm under low-noise conditions on the data distribution. Our results indicate the cumulative regret of NEURONAL-S grows slower than that of [58] concerning K by a multiplicative factor at least O( p T log(1 + λ0)) and up to e O( md), where λ0 is the smallest eigenvalue of Neural Tangent Kernel (NTK) and m is the width of the neural network. This finding helps explain why our algorithms outperform the bandit-based algorithms, particularly when K is large, as shown in Table 1. In the binary classification task, our regret bounds directly remove the dependence of effective dimension d, which measures the actual underlying dimension in the RKHS space spanned by NTK, discussed in Sec. 5. We also provide a performance analysis for the proposed pool-based algorithm in the non-parametric setting, tailored for neural network models. In contrast, previous works focus on the regime either in parametric settings that require a finite VC dimension [31] or a linear mapping function assumption [8; 64; 28]. The above theoretical results are our third main contribution. In addition, Empirical evaluation. In the end, we perform extensive experiments to evaluate the proposed algorithms for both stream-based and pool-based algorithms compared to state-of-the-art baselines. Our evaluation encompasses various metrics, including test accuracy and running time, and we have carried out ablation studies to investigate the impact of hyper-parameters and label budgets. This is our fourth main contribution. Published as a conference paper at ICLR 2024 2 RELATED WORK Active learning has been studied for decades and adapted to many real-world applications [53]. There are several strategies in active learning to smartly query labels, such as diversity sampling [23; 67], and uncertainty sampling [69; 62; 41]. Neural networks have been widely used in various scenarios[40; 39; 26; 27; 59; 60]. The Contextual bandit is a powerful tool for decision-making [13; 10; 9; 11; 49; 50] and has been used for active learning. Works that use neural networks to perform active learning, according to the model structure, can be classified into online-learning-based [45; 6] and bandit-based [58; 14] methods. For most of the methods, their neural models take the instance as input and calculate the predicted scores for each class synchronously. For example, [45; 6; 17; 35; 54; 56; 66; 5] exploit the neural networks for active learning to improve the empirical performance. As mentioned before, this type of related work often lacks the principled exploration and provable guarantee in a non-parametric setting. The primary focus of theoretical research in active learning is to determine the convergence rate of the population loss (regret) of the hypothesis produced by the algorithm in relation to the number of queried labels N. In the parametric setting, the convergence rates of the regret are of the form νN 1/2 + e N, where ν is the population loss of the best function in the class with finite VCdimension, e.g., [29; 20; 47; 63; 7; 65]. With the realizable assumption (i.e., when the Bayes optimal classifier lies in the function class), minimax active learning rates of the form N α+1 2 are shown in [30; 36] to hold for adaptive algorithms that do not know beforehand the noise exponent α (Defined in Sec. 5). In non-parametric settings, [44; 42; 70] worked under smoothness (Holder continuity/smoothness) assumptions, implying more restrictive assumptions on the marginal data distribution. [42; 70] achieved the minimax active learning rate N β(α+1) 2β+d for β-Holder classes, where exponent β plays the role of the complexity of the class of functions to learn. [8; 64; 28] focused on the pool-based setting and achieve the ( d N )α+1/α+2 minimax rate, but their analysis are built on the linear mapping function. [37] investigated active learning with nearest-neighbor classifiers and provided a data-dependent minimax rate based on the noisy-margin properties of the random sample. [32] introduce the expected variance with Gaussian processes criterion for neural active learning. [58] and [14] represent the closest works related to the analysis of neural networks in active learning. Both [58] and our work utilized the theory of NTK, making their results directly comparable. [14] established a connection between their regret upper bound and the training error of a neural network function class. However, they did not provide a clear trade-off between the training error and the function class radius, making it challenging to compare [14] with our theoretical results. 3 PROBLEM DEFINITION In this paper, we consider the K-class classification problem for both stream-based and pool-based active learning. Let X denote the input space over Rd and Y represent the label space over {0, 1}K. D is some unknown distribution over X Y. For any round t [T] = {1, 2, . . . , T}, an instance xt is drawn from the marginal distribution DX . Accordingly, the associated label yt is drawn from the conditional distribution DY|xt, in which the dimension with value 1 indicates the ground-truth class of xt. For the clarity of presentation, we use yt,k, k [K] to represent the K possible predictions, i.e., yt,1 = [1, 0, . . . , 0] , . . . , yt,K = [0, 0, . . . , 1] . Given an instance xt DX , the learner is required to make a prediction yt,bk, bk [K] based on some function. Then, the label yt is observed. A loss function ℓ: Y Y [0, 1], represented by ℓ(yt,bk, yt), reflects the quality of this prediction. We investigate the non-parametric setting of active learning. Specifically, we make the following assumption for the conditional distribution of the loss. Assumption 3.1. The conditional distribution of the loss given xt is defined by some unknown function h : X [0, 1]K, such that k [K], Eyt DY|xt [ℓ(yt,k, yt)|xt] = h(xt)[k], (3.1) where h(xt)[k] represents the value of the kth dimension of h(xt). Assumption 3.1 is the standard formulation in [58; 14]. Related works [14; 58] restricted ℓto be 0-1 loss. In contrast, we only assume ℓis bounded for the sake of generality. Published as a conference paper at ICLR 2024 Stream-based. For stream-based active learning, at each round t [T], the learner receives an instance xt DX . Then, the learner is compelled to make a prediction yt,bk, and at the same time, decides on-the-fly whether or not to observe the label yt. Then, the goal of active learning is to minimize the Population Cumulative Regret. Given the data distribution D and the number of rounds T, the Population Cumulative Regret is defined as: Rstream(T) := h E (xt,yt) D[ℓ(yt,bk, yt)] E (xt,yt) D[ℓ(yt,k , yt)] i , (3.2) where yt,k is the prediction induced by the Bayes-optimal classifier with respect to xt in round t, i.e., k = arg mink [K] h(xt)[k]. The Population Regret reflects the generalization performance of a model in T rounds of stream-based active learning. At the same time, the goal of the learner is to minimize the expected label query cost: N(T) := PT t=1 E xt DX[It|xt], where It is an indicator of the query decision in round t such that It = 1 if yt is observed; It = 0, otherwise. Pool-based. The goal pool-based active learning is to maximize the performance of a model with a limited label budget given the pool of instances. In a round, assume there is an instance pool Pt = {x1, x2, . . . , x B} with B instances, which are drawn from DX . Then, for any xt Pt, the learner is able to request a label from the conditional distribution yt DY|xt with one unit of budget cost. The total number of queries is set as Q. Let yt,bk be the prediction of xt Pt by some hypothesis. Then, the Population Cumulative Regret for pool-based active learning is defined as: Rpool(Q) := h E (xt,yt) D[ℓ(yt,bk, yt)] E (xt,yt) D[ℓ(yt,k , yt)] i , (3.3) where k = arg mink [K] h(xt)[k]. Rpool(Q) reflects the generalization performance of a model in Q rounds of pool-based active learning. 4 PROPOSED ALGORITHMS In this section, we present the proposed algorithms for both stream-based and pool-based active learning. The proposed NN models incorporate two neural networks for exploitation and exploration. The exploitation network directly takes the instance as input and outputs the predicted probabilities for K classes synchronously, instead of taking the transformed K long vectors as input and computing the probabilities sequentially. The exploration network has a novel embedding as input to incorporate the information of K classes in the exploitation network simultaneously instead of calculating K embeddings for K arms sequentially as in bandit-based approaches. Exploitation Network f1. The exploitation network f1 is a neural network which learns the mapping from input space X to the loss space [0, 1]K. In round t [T], we denote the network by f1( ; θ1 t), where the superscript of θ1 t indicates the network and the subscript indicates the index of the round after updates for inference. f1 is defined as a fully-connected neural network with L-depth and m-width: f1(xt; θ1) := m W1 Lσ(W1 L 1 . . . σ(W1 1xt))) RK, (4.1) where W1 1 Rm d, W1 l Rm m, for 2 l L 1, W1 L RK m, θ1 = [vec(W1 1) , . . . , vec(W1 L) ] Rp1, and σ is the Re LU activation function σ(x) = max{0, x}. We randomly initialize θ1 denoted by θ1 1, where each entry is drawn from normal distribution N(0, 2/m) for W1 l , l [L 1], and each entry is drawn from N(0, 1/(Km)) for W1 L. Note that we take the basic fully-connected network as an example for the sake of analysis in overparameterized networks, and f1 can be easily replaced with other advanced models depending on the tasks. Given an instance xt, f1(xt; θ1 t) can be considered as the estimation for ℓ(yt,k, yt), k [K]. In round t, after receiving the label yt, we conduct stochastic gradient descent to update θ1, based on the loss function Lt,1(θ1 t), such as Lt,1(θ1 t) = P k [K] f1(xt; θ1 t)[k] ℓ(yt,k, yt) 2/2, where f1(xt; θ1 t)[k] represents the value of the kth dimension of f1(xt; θ1 t). Published as a conference paper at ICLR 2024 Algorithm 1 NEURONAL-S Input: T, K, f1, f2, η1, η2 (learning rate), γ (exploration para.), δ (confidence para.), S (norm para.) 1: Initialize θ1 1, θ2 1; bθ 1 1 = θ1 1; bθ 2 1 = θ2 1, Ω1 = {(bθ 1 1, bθ 2 1)} 2: for t = 1, 2, . . . , T do 3: Receive an instance xt DX 4: f(xt; θt) = f1(xt; θ1 t) + f2(ϕ(xt); θ2 t) 5: bk = arg mink [K] f(xt; θt)[k] 6: k = arg mink ([K]\{ bk }) f(xt; θt)[k] 7: Predict yt,bk 8: It = 1{|f(xt; θt)[bk] f(xt; θt)[k ]| < 2γβt} {0, 1}; βt = q 2 log(3T/δ) t 9: Observe yt and ℓ, if It = 1; yt = yt,bk, otherwise 10: bθ 1 t+1 = bθ 1 t η1 bθ 1Lt,1(bθ 1 t) 11: bθ 2 t+1 = bθ 2 t η2 bθ 2Lt,2(bθ 2 t) 12: Ωt+1 = Ωt {(bθ 1 t+1, bθ 2 t+1)} 13: Draw (θ1 t+1, θ2 t+1) uniformly from Ωt+1 14: end for 15: return θT Exploration Network f2. The exploration network f2 learns the uncertainty of estimating f1( ; θ1). In round t [T], given an instance xt and the estimation f1(xt; θ1 t), the input of f2 is a mapping or embedding ϕ(xt) that incorporates the information of both xt and the discriminative information of θ1 t. We introduce the following embedding ϕ(xt): Definition 4.1 (End-to-end Embedding). Given the exploitation network f1( ; θ1 t) and an input context xt, its end-to-end embedding is defined as ϕ(xt) = σ(W1 1xt) , vec( W1 Lf1(xt; θ1 t)) Rm+Km, (4.2) where the first term is the output of the first layer of f1 and the second term is the partial derivative of f1 with respect to the parameters of the last layer. ϕ(xt) is usually normalized. The first term σ(W1 1xt) can be thought of as an embedding to transform xt Rd into another space in Rm, as the input dimension d can be a very large number. Thus, we leverage the representation power of f1 and minimize the dependence on d for f2. The last layer might play the more important role in terms of classification ability [46] and hence we only take the gradient of the last layer which incorporates the discriminative information of θ1 t, to reduce the computation cost. Then, specifically, given an embedding ϕ(xt), f2 is defined as: f2(ϕ(xt); θ2) := m W2 Lσ(W2 L 1 . . . σ(W2 1ϕ(xt))) RK, (4.3) where W2 1 Rm (m+Km), W2 l Rm m, for 2 l L 1, W2 L RK m, θ2 = [vec(W2 1) , . . . , vec(W2 L) ] Rp2, and σ is the Re LU activation function σ(x) = max{0, x}. Similarly, we randomly initialize θ2 denoted by θ2 1, where each entry is drawn from Normal distribution N(0, 2/m) for W2 l , l [L 1], and each entry is drawn from N(0, 1/Km) for W2 L. In round t, after receiving yt, the label for training f2 is the estimated error of f1( ; θ1)[k], represented by (ℓ(yt,k, yt) f1(xt; θ1 t)[k]). Therefore, we conduct stochastic gradient descent to update θ2, based on the loss Lt,2(θ2 t), such as Lt,2(θ2 t) = P k [K] f2(ϕ(xt); θ2 t)[k] ℓ(yt,k, yt) f1(xt; θ1 t)[k] 2 /2, where f2(ϕ(xt); θ2 t)[k] represents the value of the kth dimen- sion of f2(ϕ(xt); θ2 t). Stream-based Algorithm. Our proposed stream-based active learning algorithm is described in Algorithm 1. In a round, when receiving a data instance, we calculate its exploitation-exploration Published as a conference paper at ICLR 2024 Algorithm 2 NEURONAL-P Input: Q, B, K, f1, f2, η1, η2, µ, γ 1: Initialize θ1 1, θ2 1 2: for t = 1, 2, . . . , Q do 3: Draw B instances, Pt, from DX 4: for xi Pt do 5: f(xi; θt) = f1(xi; θ1 t) + f2(ϕ(xi); θ2 t) 6: bk = arg mink [K] f(xi; θt)[k] 7: k = arg mink ([K]\{ bk }) f(xi; θt)[k] 8: wi = f(xi; θt)[bk] f(xi; θt)[k ] 9: end for 10: wˆi = mini [B] wi 11: For each i = ˆi, pi = wˆi µwˆi+γ(wi wˆi) 12: pˆi = 1 P i=ˆi pi 13: Draw one instance xt from Pt according to probability distribution P formed by pi 14: Query xt, Predict yt,bk, and observe yt 15: Update θ1, θ2 as Lines 11-12 in Algorithm 1 16: end for 17: return θQ score and then make a prediction (Lines 3-7), where k is used in the decision-maker (Line 8). When It = 1, which indicates that the uncertainty of this instance is high, we query the label of this data instance to attain new information; otherwise, we treat our prediction as the pseudo label (Line 9). Finally, we use the SGD to train and update the parameters of NN models (Lines 10-13). Here, we draw the parameters from the pool Ωt for the sake of analysis to bound the expected approximation error of one round. One can use the mini-batch SGD to avoid this issue in implementation. Pool-based Algorithm. Our proposed pool-based active learning algorithm is described in Algorithm 2. In each round, we calculate the exploitation and exploration scores, and then calculate the prediction gap w for each data instance (Lines 4-9). Then, we form a distribution of data instances (Lines 10-13), P, where Lines 11-12 are inspired by the selection scheme in [2]. The intuition behind this is as follows. The prediction gap wi reflects the uncertainty of this data instance. Smaller wi shows the larger uncertainty of xi. Thus, the smaller wi, the higher the drawing probability pi (Line 11). Finally, we draw a data instance according to the sampling probability P and conduct SGD to update the neural networks. 5 REGRET ANALYSIS In this section, we provide the regret analysis for both the proposed stream-based and pool-based active learning algorithms. Existing works such as [58; 42; 70] studied active learning in binary classification, where the label space Y [0, 1] can be parametrized by the Mammen-Tsybakov low noise condition [43]: There exist absolute constants c > 0 and α 0, such that for all 0 < ϵ < 1/2, x X, k {0, 1}, P(|h(x)[k] 1 2| ϵ) cϵα. For simplicity, we are interested in the two extremes: α = 0 results in no assumption whatsoever on D while α = gives the hard-margin condition on D. These two conditions can be directly extended to the K-class classification task. Next, we will provide the regret analysis and comparison with [58]. Our analysis is associated with the NTK matrix H defined in C.1. There are two complexity terms in [58] as well as in Neural Bandits [68; 12]. The assumption H λ0I is generally held in this literature to guarantee that there exists a solution for NTK regression. The first complexity term is S = h H 1h where h = [h(x1)[1], h(x1)[2], . . . , h(x T )[K]] RT K. S is to bound the optimal parameters in NTK regression: there exists θ Rp such that h(xt)[k] = θf(xt; θ )[k], θ θ1 and θ θ1 2 S. The second complexity term is the effective dimension d, defined as d = Published as a conference paper at ICLR 2024 log det(I+H) log(1+T K) , which describes the actual underlying dimension in the RKHS space spanned by NTK. [58] used the term LH to represent d: LH = log det(I + H) = d log(1 + TK). We provide the upper bound and lower bound of LH in Appendix C: TK log(1 + λ0) LH e O(md K). First, we present the regret and label complexity analysis for Algorithm 1 in binary classification, which is directly comparable to [58]. Theorem 5.1. [Binary Classification] Given T, for any δ (0, 1), λ0 > 0, suppose K = 2, xt 2 = 1, t [T], H λ0I, m eΩ(poly(T, L, S) log(1/δ)), η1 = η2 = Θ( S m 2T ). Then, with probability at least 1 δ over the initialization of θ1 1, θ2 1, Algorithm 1 achieves the following regret bound: Rstream(T) e O((S2) α+1 α+2 T 1 α+2 ), N(T) e O((S2) α α+2 T 2 α+2 ). Compared to [58], Theorem 5.1 removes the term e O LH α+2 T 1 α+2 and further improve the regret upper bound by a multiplicative factor LH α+1 α+2 . For the label complexity N(T), compared to [58], Theorem 5.1 removes the term e O LH 2α α+2 T 2 α+2 and further improve the regret upper bound by a multiplicative factor LH Next, we show the regret bound for the stream-based active learning algorithm (Algorithm 1) in K-classification, without any assumption on the distribution D (Tsybakov noise α = 0). Theorem 5.2. [Stream-based]. Given T, for any δ (0, 1), λ0 > 0, suppose xt 2 = 1, t [T], H λ0I, m eΩ(poly(T, K, L, S) log(1/δ)), η1 = η2 = Θ( S m T K ). Then, with probability at least 1 δ over the initialization of θ1 1, θ2 1, Algorithm 1 achieves the following regret bound: Rstream(T) O( 2 log(3T/δ) where N(T) O(T). Theorem 5.2 shows that NEURONAL-S achieves a regret upper bound of e O( TKS). Under the same condition (α = 0), [58] obtains the regret upper bound: Rstream(T) e O TLHS + This core term is further bounded by e O( TK log(1 + λ0)S) e O TLHS e O( Tmd KS) . Concerning K, Theorem 5.2 results in regret-growth rate TS) in contrast with Tmd S) at most in [58]. Therefore, our regret bound has a slower growth rate concerning K compared to [58] by a multiplicative factor at least O( p T log(1 + λ0)) and up to e O( md). In binary classification, we reduce the regret bound by e O( TLH), i.e., remove the dependency of d, and further improve the regret bound by a multiplicative factor at least O( p T log(1 + λ0)). The regret analysis of [14] introduced two other forms of complexity terms: e O( T ( µ + ν)). µ is the data-dependent term interpreted as the minimal training error of a function class on the data, while ν is the function class radius. [3] had implied that when ν has the order of e O(poly(T)), µ can decrease to e O(1). But, their regret bounds also depend on O(ν). This indicates that their regret analysis is invalid when ν has O(T). Since [14] did not provide the upper bound and trade-off of µ and ν or build the connection with NTK, their results are not readily comparable to ours. For the label complexity, Theorem 5.2 has the trivial O(T) complexity which is the same as [58; 14]. This turns into the active learning minimax rate of N(T) 1/2, which is indeed the best rate under α = 0 [16; 30; 36; 21]. Next, we provide the analysis under the following margin assumption. Assumption 5.1 ([14]). Given an instance xt and the label yt, xt has a unique optimal class if there exists ϵ > 0 such that t [T], h(xt)[k ] h(xt)[k ] ϵ, (5.1) where k = arg mink [K] h(xt)[k] and k = arg mink ([K]\{k }) h(xt)[k]. Published as a conference paper at ICLR 2024 Assumption 5.1 describes that there exists a unique Bayes-optimal class for each input instance. Then, we have the following theorem. Theorem 5.3. [Stream-based]. Given T, for any δ (0, 1), γ > 1, λ0 > 0, suppose xt 2 = 1, t [T], H λ0I, m eΩ(poly(T, K, L, S) log(1/δ)), η1 = η2 = Θ( S m T K ), and Assumption 5.1 holds. Then, with probability at least 1 δ over the initialization of θ1 1, θ2 1, Algorithm 1 achieves the following regret bound and label complexity: Rstream(T) O((KS2 + log(3T/δ))/ϵ), N(T) O((KS2 + log(3T/δ))/ϵ2). (5.2) Theorem 5.3 provides the regret upper bound and label complexity of e O(KS2) for NEURONAL-S under margin assumption. With α , [58] obtained e O(LH(LH +S2)) e O(md K(md K +S2)). Therefore, with margin assumption, the regret of NEURONAL-S grows slower than [58] by a multiplicative factor up to O(md) with respect to K. [14] achieved e O(ν2 + µ), but the results are not directly comparable to ours, as discussed before. Theorem 5.4 (Pool-based). For any 1 > δ > 0, λ0 > 0, by setting µ = Q and γ = p Q/(KS2), suppose xt 2 = 1 for all instances, H λ0I, m eΩ(poly(Q, K, L, R) log(1/δ)), η1 = η2 = Θ( 1 m LK ). Then, with probability at least 1 δ over the initilization of θ1 1, θ2 1, Algorithm 2 achieves: Rpool(Q) O( p log(2δ 1) + O( p Q log(3δ 1)). Theorem 5.4 provides a performance guarantee of e O( QKS) for NEURONAL-P in the nonparametric setting with neural network approximator. This result shows that the pool-based algorithm can achieve a regret bound of the same complexity as the stream-based algorithm (Theorem 5.2) with respect to the number of labels. Meanwhile, Theorem 5.4 indicates the e O( p 1/Q) minimax rate in the pool-based setting which matches the rate (α = 0) in [8; 64; 28]. However, the results in [8; 64; 28] only work with the linear separator, i.e., they assume h as the linear function with respect to the data instance x. Note that [58; 14] only work on stream-based active learning. 6 EXPERIMENTS In this section, we evaluate NEURONAL for both stream-based and pool-based settings on the following six public classification datasets: Adult, Covertype (CT), Magic Telescope (MT), Shuttle [24], Fashion [61], and Letter [18]. For all NN models, we use the same width m = 100 and depth L = 2. Due to the space limit, we move additional results (figures) to Appendix A. Stream-based Setups. We use two metrics to evaluate the performance of each method: (1) test accuracy and (2) cumulative regret. We set T = 10, 000. After T rounds of active learning, we evaluate all the methods on another 10, 000 unseen data points and report the accuracy. As MT does not have enough data points, we use 5, 000 data points for the evaluation. In each iteration, one instance is randomly chosen and the model predicts a label. If the predicted label is wrong, the regret is 1; otherwise 0. The algorithm can choose to observe the label, which will reduce the query budget by one. We restrict the query budget to avoid the situation of the algorithm querying every data point in the dataset, following [14]. The default label budget is 30% T. We ran each experiment 10 times and reported the mean and std of results. We use three baselines for comparison in the stream-based setting: (1) Neur AL-NTK [58], (2) I-Neur AL [14], and (3) ALPS [22]. Due to the space limit, we reported the cumulative regret and more details in Appendix A.1. Stream-based Results. To evaluate the effectiveness of NEURONAL-S, first, we report the test accuracy of bandit-based methods in Table 1, which shows the generalization performance of each method. From this table, we can see that the proposed NEURONAL-S consistently trains the best model compared to all the baselines for each dataset, where NEURONAL-S achieves non-trivial improvements under the same label budget with the same width and depth of NN models. Compared to bandit-based approaches, Neur AL-NTK and I-Neur AL, NEURONAL-S has new input and output, which enable us to better exploit the full feedback in active learning instead of bandit feedback and to explore more effectively. The results of ALPS are placed in Table 3. ALPS manages to select the best pre-trained hypotheses for the data. However, the model parameters are fixed before the online active learning process. Hence, ALPS is not able to take the new knowledge obtained by queries into account and its performance is highly limited by the hypothesis class. Table 1 shows the running time Published as a conference paper at ICLR 2024 Table 2: Testing accuracy (%) and running time on all methods in Pool-based Setting Adult MT Letter Covertype Shuttle Fashion Accuracy Core Set 76.7 1.13 75.1 0.79 80.6 0.63 62.6 3.11 97.7 0.41 80.4 0.08 BADGE 76.6 0.49 71.6 0.81 81.7 0.57 64.8 1.02 98.6 0.39 76.1 0.21 Dynamic AL 72.4 0.14 67.8 1.01 63.2 0.31 54.1 0.12 78.7 0.05 54.5 0.19 ALBL 78.1 0.45 73.9 0.71 81.9 0.47 65.3 0.14 98.6 0.37 77.6 0.32 NEURONAL-P 79.1 0.04 81.3 0.12 83.7 0.07 67.6 0.21 99.5 0.01 81.1 0.13 Running Time Core Set 43.1 7.65 119.3 4.42 228.3 6.51 32.5 10.94 10.9 2.22 33.1 6.32 BADGE 159.5 4.61 212.5 9.32 484.8 7.04 545.7 9.32 222.9 5.13 437.8 5.32 Dynamic AL 24.3 5.21 821.5 6.14 382.3 3.13 621.6 3.21 483.4 9.78 413.2 7.14 ALBL 315.8 4.31 343.5 6.24 271.3 6.32 481.3 5.21 63.2 2.16 92.1 3.42 NEURONAL-P 17.2 3.24 140.1 3.69 133.7 12.8 14.1 5.81 15.6 8.03 25.5 7.80 of NEURONAL-S compared to bandit-based approaches. NEURONAL-S achieves the best empirical performance and significantly saves the time cost. The speed-up is boosted when K is large. This is because of NEURONAL-S s new NN model structure to calculate the predicted score synchronously. In contrast, bandit-based approaches calculate the score sequentially, of which the computational cost is scaled by K. We also conduct the ablation study for different label budgets in the active learning setting placed in Appendix A. Pool-based Setups. We use the test accuracy to evaluate the performance of each method. Following [57], we use batch active learning. In each active learning round, we select and query 100 points for labels. After 10 rounds of active learning, we evaluate all the methods on another 10, 000 unseen data points and report the accuracy. We run each experiment 10 times and report the mean and standard deviation of results. The four SOTA baselines are: (1) Core Set [52], (2) BADGE [6], (3) Dynamic AL [57], and (4) ALBL [33]. Pool-based results. To evaluate the effectiveness of NEURONAL-P, we report the test accuracy of all methods in Table 2. Our method, NEURONAL-P, can consistently outperform other baselines across all datasets. This indicates that these baselines without explicit exploration are easier to be trapped in sub-optimal solutions. Our method has a more effective network structure (including the principled exploration network), allowing us to exploit the full feedback and explore new knowledge simultaneously in active learning. Moreover, we use the inverse gap strategy to draw samples, which further balances exploitation and exploration. Core Set always chooses the maximal distance based on the embeddings derived by the last layer of the exploitation neural network, which is prone to be affected by outlier samples. BADGE also works on the last layer of the exploitation network using the seed-based clustering method, and it is not adaptive to the state of the exploitation network. Dynamic AL relies on the training dynamics of the Neural Tangent Kernel, but the rules it uses might only work on the over-parameterized neural networks based on the analysis. ALBL is a hybrid active learning algorithm that combines Coreset and Conf [55]. ALBL shows a stronger performance but still is outperformed by our algorithm. As mentioned before, these baselines do not provide a theoretical performance guarantee. For the running time, as there are B samples in a pool in each round, NEURONAL-P takes O(2B) to make a selection. For other baselines, Core Set takes O(2B). In addition to O(2B) cost, BADGE and Dynamic AL need to calculate the gradient for each sample, which is scaled by O(B). Thus, BADGE and Dynamic AL are slow. ALBA is slow because it contains another algorithm, and thus, it has two computational costs in addition to neural models. We also conduct the hyper-parameter sensitivity study for different label budgets in the active learning setting placed in Appendix A.2. 7 CONCLUSION In this paper, we propose two algorithms for both stream-based and pool-based active learning. The proposed algorithms mitigate the adverse effects of K in terms of computational cost and performance. The proposed algorithms build on the newly designed exploitation and exploration neural networks, which enjoy a tighter provable performance guarantee in the non-parametric setting. Ultimately, we use extensive experiments to demonstrate the improved empirical performance in streamand pool-based settings. Published as a conference paper at ICLR 2024 ACKNOWLEDGEMENT This work is supported by National Science Foundation under Award No. IIS-2117902, IIS-2002540, IIS-2134079, Agriculture and Food Research Initiative (AFRI) grant no. 2020-67021-32799/project accession no.1024178 from the USDA National Institute of Food and Agriculture, MIT-IBM Watson AI Lab, and IBM-Illinois Discovery Accelerator Institute - a new model of an academic-industry partnership designed to increase access to technology education and skill development to spur breakthroughs in emerging areas of technology. The views and conclusions are those of the authors and should not be interpreted as representing the official policies of the funding agencies or the government. [1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312 2320, 2011. [2] Naoki Abe and Philip M Long. Associative reinforcement learning using linear probabilistic concepts. In ICML, pp. 3 11. Citeseer, 1999. [3] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242 252. PMLR, 2019. [4] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pp. 8141 8150, 2019. [5] Jordan Ash, Surbhi Goel, Akshay Krishnamurthy, and Sham Kakade. Gone fishing: Neural active learning with fisher embeddings. Advances in Neural Information Processing Systems, 34, 2021. [6] Jordan T Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. ar Xiv preprint ar Xiv:1906.03671, 2019. [7] Pranjal Awasthi, Maria Florina Balcan, and Philip M Long. The power of localization for efficiently learning linear separators with noise. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pp. 449 458, 2014. [8] Maria-Florina Balcan and Phil Long. Active and passive learning of linear separators under log-concave distributions. In Conference on Learning Theory, pp. 288 316. PMLR, 2013. [9] Yikun Ban and Jingrui He. Generic outlier detection in multi-armed bandit. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 913 923, 2020. [10] Yikun Ban and Jingrui He. Local clustering in contextual multi-armed bandits. In Proceedings of the Web Conference 2021, pp. 2335 2346, 2021. [11] Yikun Ban, Jingrui He, and Curtiss B Cook. Multi-facet contextual bandits: A neural network perspective. In The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, pp. 35 45, 2021. [12] Yikun Ban, Yunzhe Qi, Tianxin Wei, and Jingrui He. Neural collaborative filtering bandits via meta learning. Ar Xiv abs/2201.13395, 2022. [13] Yikun Ban, Yuchen Yan, Arindam Banerjee, and Jingrui He. EE-net: Exploitation-exploration neural networks in contextual bandits. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=X_ch3Vr NSRg. Published as a conference paper at ICLR 2024 [14] Yikun Ban, Yuheng Zhang, Hanghang Tong, Arindam Banerjee, and Jingrui He. Improved algorithms for neural active learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=ri Ia C2ivc YA. [15] Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in Neural Information Processing Systems, 32:10836 10846, 2019. [16] Rui M Castro and Robert D Nowak. Minimax bounds for active learning. IEEE Transactions on Information Theory, 54(5):2339 2353, 2008. [17] Gui Citovsky, Giulia De Salvo, Claudio Gentile, Lazaros Karydas, Anand Rajagopalan, Afshin Rostamizadeh, and Sanjiv Kumar. Batch active learning at scale. Advances in Neural Information Processing Systems, 34, 2021. [18] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), pp. 2921 2926. IEEE, 2017. [19] David A Cohn, Zoubin Ghahramani, and Michael I Jordan. Active learning with statistical models. Journal of artificial intelligence research, 4:129 145, 1996. [20] Sanjoy Dasgupta, Daniel J Hsu, and Claire Monteleoni. A general agnostic active learning algorithm. Advances in neural information processing systems, 20, 2007. [21] Ofer Dekel, Claudio Gentile, and Karthik Sridharan. Selective sampling and active learning from single and multiple teachers. The Journal of Machine Learning Research, 13(1):2655 2697, 2012. [22] Giulia De Salvo, Claudio Gentile, and Tobias Sommer Thune. Online active learning with surrogate loss functions. Advances in Neural Information Processing Systems, 34, 2021. [23] Bo Du, Zengmao Wang, Lefei Zhang, Liangpei Zhang, Wei Liu, Jialie Shen, and Dacheng Tao. Exploring representativeness and informativeness for active learning. IEEE transactions on cybernetics, 47(1):14 26, 2015. [24] Dheeru Dua, Casey Graff, et al. Uci machine learning repository. 2017. [25] Dylan Foster and Alexander Rakhlin. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pp. 3199 3210. PMLR, 2020. [26] Dongqi Fu, Dawei Zhou, and Jingrui He. Local motif clustering on time-evolving graphs. In KDD 20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pp. 390 400. ACM, 2020. [27] Dongqi Fu, Dawei Zhou, Ross Maciejewski, Arie Croitoru, Marcus Boyd, and Jingrui He. Fairness-aware clique-preserving spectral clustering of temporal graphs. In Proceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023, pp. 3755 3765. ACM, 2023. [28] Claudio Gentile, Zhilei Wang, and Tong Zhang. Achieving minimax rates in pool-based batch active learning. In International Conference on Machine Learning, pp. 7339 7367. PMLR, 2022. [29] Steve Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th international conference on Machine learning, pp. 353 360, 2007. [30] Steve Hanneke. Adaptive rates of convergence in active learning. In COLT. Citeseer, 2009. [31] Steve Hanneke et al. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, 7(2-3):131 309, 2014. Published as a conference paper at ICLR 2024 [32] Apivich Hemachandra, Zhongxiang Dai, Jasraj Singh, See-Kiong Ng, and Bryan Kian Hsiang Low. Training-free neural active learning with initialization-robustness guarantees. In International Conference on Machine Learning, pp. 12931 12971. PMLR, 2023. [33] Wei-Ning Hsu and Hsuan-Tien Lin. Active learning by learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015. [34] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571 8580, 2018. [35] Yoon-Yeong Kim, Kyungwoo Song, Joon Ho Jang, and Il-chul Moon. Lada: Look-ahead data acquisition via augmentation for deep active learning. Advances in Neural Information Processing Systems, 34, 2021. [36] Vladimir Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. The Journal of Machine Learning Research, 11:2457 2485, 2010. [37] Aryeh Kontorovich, Sivan Sabato, and Ruth Urner. Active nearest-neighbor learning in metric spaces. Advances in Neural Information Processing Systems, 29, 2016. [38] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670, 2010. [39] Lihui Liu, Boxin Du, Jiejun Xu, Yinglong Xia, and Hanghang Tong. Joint knowledge graph completion and question answering. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1098 1108, 2022. [40] Lihui Liu, Blaine Hill, Boxin Du, Fei Wang, and Hanghang Tong. Conversational question answering with reformulations over knowledge graph. ar Xiv preprint ar Xiv:2312.17269, 2023. [41] Zhuoming Liu, Hao Ding, Huaping Zhong, Weijia Li, Jifeng Dai, and Conghui He. Influence selection for active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9274 9283, 2021. [42] Andrea Locatelli, Alexandra Carpentier, and Samory Kpotufe. Adaptivity to noise parameters in nonparametric active learning. In Proceedings of the 2017 Conference on Learning Theory, PMLR, 2017. [43] Enno Mammen and Alexandre B Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808 1829, 1999. [44] Stanislav Minsker. Plug-in approach to active learning. Journal of Machine Learning Research, 13(1), 2012. [45] Jooyoung Moon, Jihyo Kim, Younghak Shin, and Sangheum Hwang. Confidence-aware learning for deep neural networks. In international conference on machine learning, pp. 7034 7044. PMLR, 2020. [46] Eliya Nachmani, Yair Be ery, and David Burshtein. Learning to decode linear codes using deep learning. In 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 341 346. IEEE, 2016. [47] Aldo Pacchiano, My Phan, Yasin Abbasi Yadkori, Anup Rao, Julian Zimmert, Tor Lattimore, and Csaba Szepesvari. Model selection in contextual stochastic bandit problems. Advances in Neural Information Processing Systems, 33:10328 10337, 2020. [48] Remus Pop and Patric Fulop. Deep ensemble bayesian active learning: Addressing the mode collapse issue in monte carlo dropout via ensembles. ar Xiv preprint ar Xiv:1811.03897, 2018. [49] Yunzhe Qi, Yikun Ban, and Jingrui He. Graph neural bandits. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1920 1931, 2023. Published as a conference paper at ICLR 2024 [50] Yunzhe Qi, Yikun Ban, Tianxin Wei, Jiaru Zou, Huaxiu Yao, and Jingrui He. Meta-learning with neural bandit scheduler. Advances in Neural Information Processing Systems, 36, 2024. [51] Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B Gupta, Xiaojiang Chen, and Xin Wang. A survey of deep active learning. ACM computing surveys (CSUR), 54 (9):1 40, 2021. [52] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. ar Xiv preprint ar Xiv:1708.00489, 2017. [53] Burr Settles. Active learning literature survey. 2009. [54] Wei Tan, Lan Du, and Wray Buntine. Diversity enhanced active learning with strictly proper scoring rules. Advances in Neural Information Processing Systems, 34, 2021. [55] Dan Wang and Yi Shang. A new active labeling method for deep learning. In 2014 International joint conference on neural networks (IJCNN), pp. 112 119. IEEE, 2014. [56] Haonan Wang, Wei Huang, Andrew Margenot, Hanghang Tong, and Jingrui He. Deep active learning by leveraging training dynamics. ar Xiv preprint ar Xiv:2110.08611, 2021. [57] Haonan Wang, Wei Huang, Ziwei Wu, Hanghang Tong, Andrew J Margenot, and Jingrui He. Deep active learning by leveraging training dynamics. Advances in Neural Information Processing Systems, 35:25171 25184, 2022. [58] Zhilei Wang, Pranjal Awasthi, Christoph Dann, Ayush Sekhari, and Claudio Gentile. Neural active learning with performance guarantees. Advances in Neural Information Processing Systems, 34, 2021. [59] Tianxin Wei and Jingrui He. Comprehensive fair meta-learned recommender system. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1989 1999, 2022. [60] Tianxin Wei, Bowen Jin, Ruirui Li, Hansi Zeng, Zhengyang Wang, Jianhui Sun, Qingyu Yin, Hanqing Lu, Suhang Wang, Jingrui He, et al. Towards universal multi-modal personalization: A language model empowered generative paradigm. In The Twelfth International Conference on Learning Representations, 2023. [61] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. 2017. [62] Donggeun Yoo and In So Kweon. Learning loss for active learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 93 102, 2019. [63] Chicheng Zhang. Efficient active learning of sparse halfspaces. In Conference on Learning Theory, pp. 1856 1880. PMLR, 2018. [64] Chicheng Zhang and Yinan Li. Improved algorithms for efficient active learning halfspaces with massart and tsybakov noise. In Conference on Learning Theory, pp. 4526 4527. PMLR, 2021. [65] Chicheng Zhang, Jie Shen, and Pranjal Awasthi. Efficient active learning of sparse halfspaces with arbitrary bounded noise. Advances in Neural Information Processing Systems, 33:7184 7197, 2020. [66] Yuheng Zhang, Hanghang Tong, Yinglong Xia, Yan Zhu, Yuejie Chi, and Lei Ying. Batch active learning with graph neural networks via multi-agent deep reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence, 36:9118 9126, 2022. [67] Fedor Zhdanov. Diverse mini-batch active learning. ar Xiv preprint ar Xiv:1901.05954, 2019. [68] Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pp. 11492 11502. PMLR, 2020. Published as a conference paper at ICLR 2024 [69] Jingbo Zhu and Matthew Ma. Uncertainty-based active learning with instability estimation for text classification. ACM Transactions on Speech and Language Processing (TSLP), 8(4):1 21, 2012. [70] Yinglun Zhu and Robert D Nowak. Active learning with neural networks: Insights from nonparametric statistics. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https:// openreview.net/forum?id=LRMmgkco Cn W. Published as a conference paper at ICLR 2024 A EXPERIMENTS DETAILS In this section, we provide more experiments and details for both stream-based and pool-based settings. Figure 1: Regret comparison on six datasets in the stream-based setting. NEURONAL-S outperforms baselines on most datasets. A.1 STREAM-BASED Table 3: Test accuracy of all methods in stream-based setting Adult MT Letter Covertype Shuttle Fashion Neur AL-NTK [58] 80.1 0.06 76.9 0.1 79.6 0.23 62.1 0.02 96.2 0.21 64.3 0.13 I-Neur AL 84.1 0.29 79.9 0.14 82.9 0.04 65.2 0.16 99.3 0.07 73.6 0.29 ALPS 75.6 0.19 35.9 0.76 73.0 0.41 36.2 0.25 78.5 0.21 74.3 0.01 NEURONAL-S 84.7 0.32 83.7 0.16 86.8 0.12 74.5 0.15 99.7 0.02 82.2 0.18 Baselines. Given an instance, Neur AL-NTK is a method that predicts K scores for K classes sequentially, only based on the exploitation NN classifier with an Upper-Confidence-Bound (UCB). On the contrary, I-Neur AL predicts K scores for K classes sequentially, based on both the exploitation and exploration NN classifiers. Neur AL-NTK and I-Neur AL query for a label when the model cannot differentiate the Bayes-optimal class from other classes. Finally, ALPS makes a prediction by choosing a hypothesis (from a set of pre-trained hypotheses) that minimizes the loss of labeled and pseudo-labeled data, and queries based on the difference between hypotheses. Implementation Details. We perform hyperparameter tuning on the training set. Each method has a couple of hyperparameters: the learning rate, number of epochs, batch size, label budget percentage, and threshold (if applicable). During hyperparameter tuning for all methods, we perform a grid search over the values {0.0001, 0.0005, 0.001} for the learning rate, {10, 20, 30, 40, 50, 60, 70, 80, 90} for the number of epochs, {32, 64, 128, 256} for the batch size, {0.1, 0.3, 0.5, 0.7, 0.9} for the label budget percentage and {1, 2, 3, 4, 5, 6, 7, 8, 9} for the threshold (exploration) parameter. Here are the final hyperparameters in the form {learning rate, number of epochs, batch size, label budget percentage, and the turning hyperparameter}: Neur AL-NTK and ALPS use {0.001, 40, 64, 0.3, 3}, and I-Neur AL uses {0.001, 40, 32, 0.3, 6}. For NEURONAL-S, we set µ = 1 for all experiments and conduct the grid search for γ over {1, 2, 3, 4, 5, 6, 7, 8, 9}. In the end, NEURONAL-S uses {0.0001, 40, 64, 0.3, 6} for all datasets. We set S = 1 as the norm parameter for NEURONAL-S all the time. Cumulative Regret. Figure 1 shows the regret comparison for each of the six datasets in 10,000 rounds of stream-based active learning. NEURONAL-S outperforms baselines on most datasets. This Published as a conference paper at ICLR 2024 demonstrates that our designed NN model attains effectiveness in the active learning process, which is consistent with the best performance achieved by NEURONAL-S in testing accuracy. Ablation study for label budget. Tables 4 to 6 show the NEURONAL-S in active learning with different budget percentages: 3%, 10%, 50 %. NEURONAL-S achieves the best performance in most of the experiments. With 3% label budget, almost all NN models are not well trained. Thus, NEURONAL does not perform stably. With 10% and 50% label budget, NEURONAL achieves better performance, because the advantages of NN models can better exploit and explore this label information. Table 4: Test accuracy with 3% budget in stream-based setting Adult Covertype Fashion Magic Telescope Letter Shuttle I-Neur AL 79.4% 52.8% 51.9% 72.3% 74.6% 93.0% Neur AL-NTK 23.9% 1.56% 11.9% 32.9% 42.8% 70.6% ALPS 24.2% 36.8% 10.0% 64.9% 72.7% 79.4% NEURONAL-S 79.9% 65.6% 69.7% 77.3% 74.2% 99.8% Table 5: Test accuracy with 10% budget in stream-based setting Adult Covertype Fashion Magic Telescope Letter Shuttle I-Neur AL 80.5% 55.4% 71.4% 77.9% 81.8% 99.2% Neur AL-NTK 70.5% 59.9% 38.7% 34.3% 53.8% 75.9% ALPS 24.2% 36.8% 10.0% 35.1% 79.9% 79.4% NEURONAL-S 79.5% 71.3% 81.3% 82.1% 81.8% 99.8% Table 6: Test accuracy with 50% budget in stream-based setting Adult Covertype Fashion Magic Telescope Letter Shuttle I-Neur AL 83.4% 65.9% 82.5% 77.9% 85.8% 99.7% Neur AL-NTK 76.9% 73.1% 56.8% 81.6% 79.3% 97.1% ALPS 75.8% 36.8% 10.0% 64.9% 81.5% 79.4% NEURONAL-S 84.6% 75.9% 85.4% 86.4% 86.9% 99.8% A.2 POOL-BASED Baselines. BADGE uses gradient embeddings to model uncertainty - if the gradient in the last neural network layer is large, the uncertainty is also large. They pick random points to query using the k-meanss++ algorithm and repeat this process. Dynamic AL introduces the concept of training dynamics into active learning. Given an instance, they assign it a pseudo-label and monitor how much the model changes. They query for the label of the point with the biggest change in the training dynamics. Core Set has a simple but powerful approach. The algorithm chooses points to query based on the loss over the labeled points and the loss over the unlabelled points, i.e., the core-set loss. ALBL is a hybrid active learning algorithm that combines Coreset and Conf [55]. Implementation details. For all methods, we conduct the same grid search as the stream-based setting for the learning rate and number of epochs. For NEURONAL-P, we perform a grid search over µ, γ {500, 1000, 2000}. Testing accuracy. Figure 2 shows the average test accuracy at each round. NEURONAL-P can outperform baselines in a few query rounds. Our method utilizes a highly effective network structure, including the principled exploration network and inverse-weight-based selection strategy. Unlike Core Set, which solely relies on the embeddings derived from the last layer of exploitation neural networks to select samples based on maximum distance, our approach avoids susceptibility to outlier samples. Similarly, BADGE also operates on the last layer of the exploitation network using the seedbased clustering method, lacking adaptability to the state of the exploitation network. Dynamic AL s approach relies on the training dynamics of the Neural Tangent Kernel that usually requires very wide neural networks. ALBL is a blending approach, but it still suffers from the limitation of Core Set. Published as a conference paper at ICLR 2024 Figure 2: Test accuracy versus the number of query rounds in pool-based setting on six datasets. NEURONAL-P outperforms baselines on all datasets. Ablation study for µ and γ. Table 7 shows NEURONAL-P with varying µ and γ values (500, 1000, 2000) on four datasets. Intuitively, if γ and µ is too small, NEURONAL-P will place more weights on the tail of the distribution of P. Otherwise, NEURONAL-P will focus more on the head of the distribution of P. From the results in the table, it seems that different datasets respond to different values of µ and γ. This sensitivity study roughly shows good values for µ, γ. Table 7: Testing Accuracy on four datasets (Letter, Adult, Fashion, and MT) with varying µ and γ in pool-based setting Letter Adult γ γ 500 1000 2000 500 1000 2000 500 80.9% 81.7% 80.5% 500 79.9% 79.4% 78.9% µ 1000 77.9% 83.9% 78.9% µ 1000 79.1% 79.7% 79.0% 2000 81.7% 81.8% 80.1% 2000 79.4% 79.4% 79.7% Fashion MT γ γ 500 1000 2000 500 1000 2000 500 80.3% 80.5% 79.5% 500 79.5% 80.9% 80.6% µ 1000 80.5% 80.6% 80.4% µ 1000 80.2% 80.9% 80.1% 2000 80.8% 80.9% 80.7% 2000 80.5% 80.6% 81.3% B STREAM-BASED VS POOL-BASED To answer the question "Can one directly convert the stream-based algorithm to the pool-based setting?", we implemented the idea that one uniformly samples data from the pool to feed it to the stream-based active learner. Denote the new algorithm by Neu-Uni S, described as follows. Neu-Uni S: In a round of pool-based active learning, we uniformly draw a sample from the pool and feed it to the stream-based learner. If the stream-based learner decides to query the label, it costs one unit of the label budget; otherwise, we keep uniformly drawing a sample until the stream-based learner decides to query the label. Once the label is queried, we train the neural model based on the sample and label. We keep doing this process until we run out of the label budget. In this algorithm, the stream-based learner is set as NEURONAL-S (Algorithm 1 in the manuscript). Published as a conference paper at ICLR 2024 Under the same setting used in our pool-based experiments, the testing accuracy of Neu-Uni S compared to our pool-based algorithm NEURONAL-P is reported in Table 8. Table 8: Test Accuracy of Neu-Uni S and NEURONAL-P Adult Covertype Fashion MT Letter Shuttle Neu-Uni S 78.0 59.4 78.6 71.3 79.4 97.9 NEURONAL-P 79.9 66.9 80.9 81.3 83.9 99.6 Why does NEURONAL-P outperform Neu-Uni S? Because Neu-Uni S does not rank data instances and only randomly chooses the data instances that satisfy the query criterion. All the stream-based algorithms have one criterion to determine whether to query the label for this data point, such as Lines 8-9 in Algorithm 1. Suppose there are 200 data points. If the 100 data points among them satisfy the criterion, then Neu-Uni S will randomly choose one from the 100 data points, because we uniformly draw a data point and feed it to stream-based learner in each round. On the contrary, NEURONAL-P has a novel component (Lines 10-12 in Algorithm 2) to rank all the data points, and then draw a sample from the newly formed distribution, to balance exploitation and exploration. To the best of our knowledge, this is the first inverse-gap weighting strategy in active learning. Thus, its analysis is also novel. In summary, stream-based algorithms cannot directly convert into pool-based algorithms, because they do not have the ranking component which is necessary in the pool-based setting. Existing works [58; 22; 14] only focus on the stream-based setting and [52] [6] [57] only focus on pool-based setting. We could hardly find existing works that incorporate both stream-based and pool-based settings. C UPPER BOUND AND LOWER BOUND FOR LH Definition C.1 (NTK [34; 58]). Let N denote the normal distribution. Given the data instances {xt}T t=1, for all i, j [T], define H0 i,j = Σ0 i,j = xi, xj , Al i,j = Σl i,i Σl i,j Σl j,i Σl j,j Σl i,j = 2Ea,b N(0,Al 1 i,j )[σ(a)σ(b)], Hl i,j = 2Hl 1 i,j Ea,b N(0,Al 1 i,j )[σ (a)σ (b)] + Σl i,j. Then, the Neural Tangent Kernel matrix is defined as H = (HL + ΣL)/2. Then, we define the following gram matrix G. Let g(x; θ0) = θf(x; θ0) Rp and G = [g(x1,1; θ0)/ m, . . . , g(x T,K; θ0)/ m] Rp T K where p = m+m Kd+m2(L 2). Therefore, we have G = G G. Based on Theorem 3.1 in [4], when m Ω(T 4K6 log(2TK/δ)/ϵ4), with probability at least 1 δ, we have Then, we have the following bound: log det(I + H) = log det (I + G + (H G)) (e1) log det(I + G) + (I + G) 1, (H G) log det(I + G) + (I + G) 1 F H G F (e2) log det(I + G) + (e3) log det(I + G) + 1 Published as a conference paper at ICLR 2024 where (e1) is because of the concavity of log det( ), (e2) is by Lemma B.1 in [68] with the choice of m, and (e3) is by the proper choice of ϵ. Then, LH can be bounded by: log det(I + H) log det(I + G) + 1 (e1) = log det(I + GG ) + 1 i=1 g(xi; θ0)g(xi; θ0) /m (e2) p log(1 + O(TK)/p) + 1 where (e1) is because of det(I + G G) = det(I + GG ) and (e2) is an application of Lemma 10 in [1] and g(xi; θ0) 2 O( m L) with L = 2. Because p = m + m Kd + m2 (L 2), we have LH = log det(I + H) e O(m Kd). (C.2) For the lower bound of LH, we have LH = log det(I + H) log λmin (I + H)T K = TK log(1 + λ0). (C.3) D PROOF OF THEOREM 5.2 AND THEOREM 5.3 First, define the general neural structure: f(xt; θ) := m WLσ(WL 1 . . . σ(W1xt))) RK, (D.1) where θ = [vec(W1) , . . . , vec(WL) ] Rp. Following [3; 15], given an instance x, we define the outputs of hidden layers of the neural network: gt,0 = xt, gt,l = σ(Wlgt,l 1), l [L 1]. Then, we define the binary diagonal matrix functioning as Re LU: Dt,l = diag(1{(Wlgt,l 1)1}, . . . , 1{(Wlgt,l 1)m}), l [L 1]. Accordingly, the neural network is represented by f(xt; θ) = m WL( l=1 Dt,l Wl)xt, and we have the following gradient form: Wlf(xt; θ) = ( m [gt,l 1WL(QL 1 τ=l+1 Dt,τWτ)] , l [L 1] m g t,L 1, l = L. Let θ1 be the random initialization of parameters of neural networks. Then, we define the following neural network function class: Given a constant R > 0, the function class is defined as B(θ1, R) = {θ Rp : θ θ1 2 R/m1/2}. (D.2) Lemma D.1 ([3]). With probability at least 1 O(TL) exp[ Ω(mω2/3L)], given ω O(L 9/2[log(m)] 3/2), for all θ, θ B(θ1, R), i [T], l [L 1] Di,l D i,l 2 O(Lω2/3m). Lemma D.2. With probability at least 1 O(TL2) exp[ Ω(mω2/3L)], uniformly over any diagonal matrices D i,1, . . . , D i,L 1 [ 1, 1]m m with at most O(mω2/3L) non-zero entries, for any θ, θ B(θ1; ω) with ω O(L 6[log(m)] 3/2), we have the following results: Published as a conference paper at ICLR 2024 τ l (Di,τ + D i,τ)Wτ 2 O( τ=l1 (Di,τ + D i,τ)Wτ F O 1 τ=l1 (D i,τ + D i,τ)W τ WL τ=l1 Di,τWτ Proof. Based on Lemma D.1, with high probability at least 1 O(n L) exp( Ω(Lω2/3m)), D i,l + D i,l D(1) i,l 0 O(Lω2/3m) 0. Applying Lemma 8.6 in [3] proves D.3. Then, by lemma 8.7 in [3] with s = O(mω2/3L) to W and W , the results hold: τ=l1 (D i,τ + D i,τ)W τ W(1) L r=l1 D(1) i,τ W(1) τ τ=l1 Di,τWτ W(1) L Y τ=l1 D(1) i,τ W(1) τ (D.6) Moreover, using Lemma D.1 gain, we have (W L W(1) L ) τ=l1 (D i,τ + D i,τ)W τ (WL W(1) L ) τ=l1 Di,τWτ Then, combining (D.6) and (D.7) leads the result. For, it has WL τ=l1 (Di,τ + D i,τ)Wτ τ=l1 (Di,τ + D i,τ)Wτ W(1) L τ=l1 D(1) i,τ W(1) τ τ=l1 D(1) i,τ W(1) τ 2 where (a) is applying the and Lemma 7.4 in [3]. The proof is completed. Lemma D.3. Suppose the derivative of loss function L O(1). With probability at least 1 O(TKL2) exp[ Ω(mω2/3L)], for all t [T], θ θ1 ω and ω O(L 6[log(m)] 3), suppose Lt(θ) 2 K, it holds uniformly that θf(xt; θ) 2 O( θf(xt; θ)[k] O( θLt(θ) 2 O p (K + L 1)m (D.10) Proof. By Lemma D.1, the result holds: WLf(xt; θ) F = mgt,L 1 2 O( m). Published as a conference paper at ICLR 2024 For l [L 1], the results hold: Wlf(xt; θ) F = m gt,l 1WL( τ=l+1 Dt,τWτ) = m gt,l 1 2 WL( τ=l+1 Dt,τWτ) F Thus, applying the union bound, for l [L], t [T], k [K] it holds uniformly that θf(xt; θ) 2 l 1 Wlf(xi; θ) 2 F O Let ek be the k-th basis vector. Then, we have θf(xt; θ)[k] 2 l 1 ek 2 2 Wlf(xi; θ) 2 F O( These prove (D.8) and (D.9). For WL, the result holds: WLLt(θ) F = Lt(θ) 2 WLf(xt; θ) F O( For l [L 1], the results hold: Wl Lt(θ) F = Lt(θ) 2 Wlf(xt; θ) F O( m). Therefore, θLt(θ) 2 = q PL l=1 Wl Lt(θ) 2 F p (K + L 1)m. Lemma D.4. With probability at least 1 O(TL2K) exp[ Ω(mω2/3L)] over random initialization, for all t [T], and θ, θ satisfying θ θ1 2 w and θ θ1 2 w with ω O(L 6[log m] 3/2), , it holds uniformly that |f(xt; θ )[k] f(xt; θ)[k] θf(xt; θ)[k], θ θ | O Proof. Let F(xt; θ )[k] = f(xt; θ)[k] θf(xt; θ)[k], θ θ and ek RK be the k-th basis vector. Then, we have f(xt; θ )[k] F(xt; θ )[k] = m l 1 e k WL( τ=l+1 Dt,τWτ)Dt,τ(W l Wl)gi,l=1 + me k W L(g t,L 1 gt,L 1). Using Claim 8.2 in [3], there exist diagonal matrices D t,l Rm m with entries in [ 1, 1] such that D t,l 0 O(mω2/3) and gt,L 1 g t,L 1 = τ=l+1 (D t,τ + D t,τ)W τ (D t,l + D t,l)(Wl W l)gt,l 1. Published as a conference paper at ICLR 2024 Thus, we have |f(xt; θ )[k] F(xt; θ )[k]| = m l=1 e k W L (D t,τ + D t,τ)W τ(D t,l + D t,l)(Wl W l)gt,l 1 l 1 e k WL( τ=l+1 Dt,τWτ)Dt,l(W l Wl)gi,l 1 l=1 gt,l 1 W l Wl 2 where (a) is applying (D.5) and (b) is based on Lemma D.1. The proof is completed. Define Lt(θ) = f(xt; θ) lt 2 and Lt,k = |f(xt; θ)[k] lt[k]|, where lt RK represents the corresponding label to train. Lemma D.5 (Almost Convexity). With probability at least 1 O(TL2K) exp[ Ω(mω2/3L)] over random initialization, for any ϵ > 0 and all t [T], and θ, θ satisfying θ θ1 2 ω and θ θ1 2 ω with ω O ϵ3/4L 9/4(Km[log m]) 3/8 O(L 6[log m] 3/2), it holds uniformly that Lt(θ ) Lt(θ) + k=1 θLt,k(θ), θ θ ϵ. Proof. Let Lt,k(θ) be the loss function with respect to f(xt; θ )[k]. By convexity of L, it holds uniformly that Lt(θ ) Lt(θ) k=1 L t,k(θ) f(xt; θ )[k] f(xt; θ)[k] k=1 L t,k(θ) f(xt; θ)[k], θ θ L t,k(θ) [f(xt; θ )[k] f(xt; θ)[k] f(xt; θ)[k], θ θ ] k=1 θLt,k(θ), θ θ k=1 |f(xt; θ )[k] f(xt; θ)[k] f(xt; θ)[k], θ θ | k=1 θLt,k(θ), θ θ K O ω4/3L3 m log m) k=1 θLt,k(θ), θ θ ϵ where (a) is due to the convexity of Lt, (b) is an application of triangle inequality, (c) is because of the Cauchy Schwarz inequality, (d) is the application of Lemma D.4, and (e) is by the choice of ω. The proof is completed. Lemma D.6 (Loss Bound). With probability at least 1 O(TL2K) exp[ Ω(mω2/3L)] over random initialization and suppose R, η, m satisfy the condition in Theorem 5.2, the result holds that t=1 Lt(θ ) + O( TKR). (D.11) Published as a conference paper at ICLR 2024 where θ = arg infθ B(θ1,R) PT t=1 Lt(θ). Proof. Let w = O ϵ3/4L 6(Km) 3/8[log m] 3/2 such that the conditions of Lemma D.5 are satisfied. Next, we aim to show θt θ1 2 ω, for any t [T]. The proof follows a simple induction. Obviously, θ1 is in B(θ1, R). Suppose that θ1, θ2, . . . , θT B(θ1, R). We have, for any t [T], t=1 θt+1 θt 2 t=1 η Lt(θt) 2 when m > eΩ(T 4L52KR8ϵ 6). This also leads to R m ω. In round t, therefore, based on Lemma D.5, for any θt θ 2 ω, it holds uniformly Lt(θt) Lt(θ ) k 1 θLt,k(θt), θt θ + ϵ, Then, it holds uniformly Lt(θt) Lt(θ t) (a) θt θt+1, θt θ (b) = θt θ 2 2 + θt θt+1 2 2 θt+1 θ 2 2 2η + ϵ θt θ 2 2 θt+1 θ 2 2 2η + 2η θLt(θt) 2 2 + ϵ (c) θt θ 2 2 θt+1 θ 2 2 2η + η(K + L 1)m + ϵ where (a) is because of the definition of gradient descent, (b) is due to the fact 2 A, B = A 2 F + B 2 F A B 2 F , (c) is by θt θt+1 2 2 = η θLt(θt) 2 2 O(η2(K + L 1)m). Then, for T rounds, we have t=1 Lt(θ t) (a) θ1 θ 2 2 2η + t=2 θt θ 2 2( 1 t=1 (L + K 1)ηm + Tϵ θ1 θ 2 2 2η + t=1 (L + K 1)ηm + Tϵ 2mη + T(K + L 1)ηm + Tϵ where (a) is by simply discarding the last term and (b) is setting by η = R m T K , L K, and T . The proof is completed. Published as a conference paper at ICLR 2024 Lemma D.7. Let G = [ θ0f(x1; θ0)[1], θ0f(x1; θ0)[2], . . . , θ0f(x T ; θ0)[K]]/ m Rp T K. Let H be the NTK defined in. Then, for any 0 < δ 1, suppose m = Ω( T 4K4L6 log(T KL/δ) λ4 0 ), then with probability at least 1 δ, we have G G H F λ0/2; Proof. Using Theorem 3.1 in [4], for any ϵ > 0 and δ (0, 1), suppose m = Ω( L6 log(L/δ) ϵ4 ), for any i, j [T], k, k [K], with probability at least 1 δ, the result holds; | θ0f(xi; θ0)[k], θ0f(xj; θ0)[k ] /m Hik,jk | ϵ. Then, take the union bound over [T] and [K] and set ϵ = λ0 2T K , with probability at least 1 δ, the result hold k =1 | θ0f(xi; θ0)[k], θ0f(xi; θ0)[k ] /m Hik,jk |2 where m = Ω( L6T 4K4 log(T 2K2L/δ) Lemma D.8. Define u = [ℓ(y1,1, y1), , ℓ(y T,K, y T )] RT K and S = u H 1u. With probability at least 1 δ, the result holds: inf θ B(θ0;S ) Proof. Suppose the singular value decomposition of G is PAQ , P Rp T K, A RT K T K, Q RT K T K, then, A 0. Define ˆθ = θ0 + PA 1Q u/ m. Then, we have m G (θ θ0) = QAP PA 1Q u = u. which leads to T X k=1 |ℓ(yt,k, yt) θ0f(xt; θ0)[k], ˆθ θ0 | = 0. Therefore, the result holds: θ θ0 2 2 = u QA 2Q u/m = u (G G) 1u/m u H 1u/m (D.12) Based on Lemma D.4, given ω = R m1/2 and initialize f(xt; θ0) 0, we have k=1 |yt[k] θ0f(xt; θ0)[k], θ θ0)| + TK O ω1/3L3p m log(m) θ θ0 2 k=1 |yt[k] θ0f(xt; θ0)[k], θ θ0)| + TK O ω4/3L3p k=1 |yt[k] θ0f(xt; θ0)[k], θ θ0)| + TK O (R/m1/2)4/3L3p k=1 |yt[k] θ0f(xt; θ0)[k], θ θ0)| + Published as a conference paper at ICLR 2024 where (a) is by the choice of m : m eΩ(T 3K3R2). Therefore, by putting them together, we have inf θ B(θ0;R) where R = S = D.1 MAIN LEMMAS Lemma D.9. Suppose m, η1, η2 satisfy the conditions in Theorem 5.3. For any δ (0, 1), with probability at least 1 δ over the initialization, it holds uniformly that t=1 E yt DX|xt h f1(xt; bθ 1 t) (ut f2(ϕ(xt); bθ 2 t)) 2 1|Ht 1 i where ut = [ℓ(yt,1, yt), . . . , ℓ(yt,K, yt)] , Ht 1 = {xτ, yτ}t 1 τ=1 is historical data. Proof. This lemma is inspired by Lemma 5.1 in [13]. For any round t [T], define h f2(ϕ(xt); bθ 2 t) (ut f1(xt; bθ 1 t)) 2 1 i f2(ϕ(xt); bθ 2 t) (ut f1(xt; bθ 1 t)) 2 1 (D.13) Then, we have E[Vt|Ft 1] =E ut h f2(ϕ(xt); bθ 2 t) (ut f1(xt; bθ 1 t)) 2 1 i h f2(ϕ(xt); bθ 2 t) (ut f1(xt; bθ 1 t)) 2 1|Ft 1 i where Ft 1 denotes the σ-algebra generated by the history H1 t 1. Therefore, {Vt}t t=1 are the martingale difference sequence. Then, applying the Hoeffding-Azuma inequality and union bound, we have t=1 E[Vt|Ft 1] As I1 is equal to 0, with probability at least 1 δ, we have h f2(ϕ(xt); bθ 2 t) (ut f1(xt; bθ 1 t)) 2 1 i t=1 f2(ϕ(xt); bθ 2 t) (ut f1(xt; bθ 1 t)) 2 For I2, applying the Lemma D.6 and Lemma D.8 to θ2, we have TKS ). (D.17) Published as a conference paper at ICLR 2024 Combining the above inequalities together and applying the union bound, with probability at least 1 δ, we have 1 T h f2(ϕ(xt); bθ 2 t) (ut f1(xt; bθ 1 t)) 2 1 i Apply the Hoeffding-Azuma inequality again on S , due to E[S ] = S, the result holds: h f2(ϕ(xt); bθ 2 t) (ut f1(xt; bθ 1 t)) 2 1 i where the union bound is applied. The proof is completed. Lemma D.10 is an variance of Lemma D.9 Lemma D.10. Suppose m, η1, η2 satisfy the conditions in Theorem 5.3. For any δ (0, 1), with probability at least 1 δ over the random initialization, for all t [T], it holds uniformly that E (xt,yt) D f1(xt; θ1 t) (ut f2(ϕ(xt); θ2 t)) 2 1|H1 t 1 2 log(3T/δ) where ut = (ℓ(yt,1, yt), . . . , ℓ(yt,K, yt)) , Ht 1 = {xτ, yτ}t 1 τ=1 is historical data, and the expectation is also taken over (θ1 t, θ2 t). Proof. For any round τ [t], define Vτ = E (xτ ,yτ ) D h f2(ϕ(xτ); bθ 2 τ) (uτ f1(xτ; bθ 1 τ)) 2 1 i f2(ϕ(xτ); bθ 2 τ) (uτ f1(xτ; bθ 1 τ)) 2 1 (D.20) Then, we have E[Vτ|Fτ 1] = E (xτ ,yτ ) D h f2(ϕ(xτ); bθ 2 τ) (uτ f1(xτ; bθ 1 τ)) 2 1 i E (xτ ,yτ ) D h f2(ϕ(xτ); bθ 2 τ) (uτ f1(xτ; bθ 1 τ)) 2 1|Fτ 1 i where Fτ 1 denotes the σ-algebra generated by the history H1 τ 1. Therefore, {Vτ}t τ=1 are the martingale difference sequence. Then, applying the Hoeffding-Azuma inequality and union bound, we have τ=1 E[Vτ|Fτ 1] As I1 is equal to 0, with probability at least 1 δ, we have τ=1 E (xτ ,yτ ) D h f2(ϕ(xτ); bθ 2 τ) (uτ f1(xτ; bθ 1 τ)) 2 1 i τ=1 f2(ϕ(xτ); bθ 2 τ) (uτ f1(xτ; bθ 1 τ)) 2 + Published as a conference paper at ICLR 2024 Based on the the definition of θ1 t 1, θ2 t 1 in Algorithm 1, we have E (xt,yt) D E (θ1,θ2) f1(xt; θ1 t) (ut f2(xt; θ2 t)) 2 1 τ=1 E (xτ ,yτ ) D h f1(xτ; bθ 1 τ) (uτ f2(xτ; bθ 2 τ)) 2 1 i . (D.24) Therefore, putting them together, we have E (xt,yt) D E (θ1,θ2) f1(xt; θ1 t 1) (ut f2(xt; θ2 t)) 2 1 τ=1 f2(xτ; bθ 2 τ) (uτ f1(xτ; bθ 1 τ)) 2 For I2, which is an application of Lemma D.6 and Lemma D.8, we have t KS ) (D.26) where (a) is because of the choice of m. Combining above inequalities together, with probability at least 1 δ, we have E (xt,yt) D f1(xt; θ1 t 1) + f2(ϕ(xt); θ2 t 1) ut 2 log(2T/δ) where we apply union bound over δ to make the above events occur concurrently for all T rounds. Apply the Hoeffding-Azuma inequality again on S completes the proof. Lemma D.11. Suppose m, η1, η2 satisfy the conditions in Theorem 5.3. For any δ (0, 1), γ > 1, with probability at least 1 δ over the random initialization, for all t [T], when It = 0, it holds uniformly that E xt DX[h(xt)[bk]] = E xt DX[h(xt)[k ]], Proof. As It = 0, we have |f(xt; θt)[bk] f(xt; θt)[k ]| = f(xt; θt)[bk] f(xt; θt)[k ] 2γβt. In round t, based on Lemma D.10, with probability at least 1 δ, the following event happens: b E0 = τ [t], k [K], E xτ DX |f(xτ; θτ)[k] h(xτ)[k]| βτ When b E0 happens with probability at least 1 δ, we have E xt DX[f(xt; θt)[bk]] βt E xt DX[h(xt)[bk]] E xt DX[f(xt; θt)[bk]] + βt E xt DX[f(xt; θt)[k ]] βt E xt DX[h(xt)[k ]] E xt DX[f(xt; θt)[k ]] + βt (D.29) Then, with probability at least 1 δ,we have E xt DX[h(xt)[bk] h(xt)[k ]] E xt DX[f(xt; θt)[bk] f(xt; θt)[k ]] 2βt 2γβt 2βt > 0 where the last inequality is because of γ > 1. Then, similarly, for any k ([K] \ {bk, k }), we have E xt DX[h(xt)[bk] h(xt)] 0. Thus, based on the definition of h(xt)[k ], we have E xt DX[h(xt)[bk]] = E xt DX[h(xt)[k ]]. The proof is completed. Published as a conference paper at ICLR 2024 Lemma D.12. When t T = e O( γ2(KS2) ϵ2 ), it holds that 2(γ + 1)βt ϵ. Proof. To achieve 2(γ + 1)βt ϵ, there exist constants C1, C2, such that t 4(γ + 1)2 KS2 + log(3T/δ) 2 log(3T/δ) t ) ϵ 2(γ + 1) The proof is completed. Lemma D.13. Suppose m, η1, η2 satisfy the conditions in Theorem 5.3. Under Assumption 5.1, for any δ (0, 1), γ > 1, with probability at least 1 δ over the random initialization, when t T , it holds uniformly: E xt DX[It] = 0, E xt DX[h(xt)[k ]] = E xt DX[h(xt)[bk]]. Proof. Define the events E1 = t T , E xt DX[h(xt)[k ] h(xt)[bk]] = 0 , E2 = t T , E xt DX[f(xt; θt)[k ] f(xt; θt)[bk]] = 0 , b E1 = t T , E xt DX[f(xt; θt)[k ] f(xt; θt)[k ]] < 2γβt The proof is to prove that b E1 will not happen. When b E0 Eq. (D.28) happens with probability at least 1 δ, we have E xt DX[f(xt; θt)[k ]] βt E xt DX[h(xt)[k ]] E xt DX[f(xt; θt)[k ]] + βt E xt DX[f(xt; θt)[k ]] βt E xt DX[h(xt)[k ]] E xt DX[f(xt; θt)[k ]] + βt (D.32) Therefore, we have E xt DX[h(xt)[k ] h(xt)[k ]] E xt DX[f(xt; θt)[k ]] + βt E xt DX[f(xt; θt)[k ]] βt E xt DX[f(xt; θt)[k ] f(xt; θt)[k ]] + 2βt. Suppose b E1 happens, we have E xt DX[h(xt)[k ] h(xt)[k ]] 2(γ + 1)βt. Then, based on Lemma D.12, when t > T , 2(γ + 1)βt ϵ. Therefore, we have E xt DX[h(xt)[k ] h(xt)[k ]] 2(γ + 1)βt ϵ. This contradicts Assumption 5.1, i.e., h(xt)[k ] h(xt)[k ] ϵ. Hence, b E1 will not happen. Accordingly, with probability at least 1 δ, the following event will happen b E2 = t T , E xt DX[f(xt; θt)[k ] f(xt; θt)[k ]] 2γβt Therefore, we have E[f(xt; θt)[k ]] > E[f(xt; θt)[k ]]. Published as a conference paper at ICLR 2024 Recall that k = arg maxk [K] h(xt)[k] and bk = arg maxk [K] f(xt; θt)[k]. As k ([K] \ {bk}), f(xt; θt)[k] f(xt; θt)[k ] k ([K] \ {bk}), E xt DX[f(xt; θt)[k]] E xt DX[f(xt; θt)[k ]], we have k ([K] \ {bk}), E xt DX[f(xt; θt)[k ]] > E xt DX[f(xt; θt)[k]]. Based on the definition of bk, we have E xt DX[f(xt; θt)[k ]] = E xt DX[f(xt; θt)[bk]] = E xt DX[max i [k]f(xt; θt)[k]]. (D.34) This indicates E2 happens with probability at least 1 δ. Therefore, based on b E2 and (D.34), the following inferred event b E3 happens with probability at least 1 δ: b E3 = t T , E xt DX[f(xt; θt)[bk] f(xt; θt)[k ]] 2γβt Then, based on Eq. D.32, we have E[h(xt)[bk] h(xt)[k ]] E[f(xt; θt)[bk]] βt (E[f(xt; θt)[k ]] + βt) = E[f(xt; θt)[bk] f(xt; θt)[k ]] 2βt E1 2(γ 1)βt > 0 where E1 is because b E3 happened with probability at least 1 δ. Therefore, we have E xt DX[h(xt)[bk]] E xt DX[h(xt)[k ]] > 0. Similarly, we can prove that k ([K] \ {bk}), E xt DX[h(xt)[bk]] E xt DX[h(xt)[k]] > 0. Then, based on the definition of k , we have E xt DX[h(xt)[bk]] = E xt DX[h(xt)[k ]] = E xt DX[max k [K] h(xt)[k]]. Thus, the event E1 happens with probability at least 1 δ. D.2 LABEL COMPLEXITY Lemma D.14 (Label Complexity Analysis). For any δ (0, 1), γ 1, suppose m satisfies the conditions in Theorem 5.2. Then, with probability at least 1 δ, we have NT T . (D.36) Proof. Recall that xt,bi = maxxt,i,i [k] f(xt; θt)[k], and xt,i = maxxt,i,i ([k]/{xt,bi}) f(xt; θt)[k]. With probability at least 1 δ, according to Eq. (D.28) the event b E0 = τ [t], k [K], E xτ DX [|f(xτ; θτ)[k] h(xτ)[k]|] βτ happens. Therefore, we have E xt DX[h(xt)[bk]] βt E xt DX[f(xt; θt)[bk]] E xt DX[h(xt)[bk]] + βt E xt DX[h(xt)[k ]] βt E xt DX[f(xt; θt)[k ]] E xt DX[h(xt)[k ]] + βt. Published as a conference paper at ICLR 2024 Then, we have E xt DX[f(xt; θt)[bk] f(xt; θt)[k ]] E xt DX[h(xt)[bk]] E xt DX[h(xt)[k ]] + 2βt E xt DX[f(xt; θt)[bk] f(xt; θt)[k ]] E xt DX[h(xt)[bk]] E xt DX[h(xt)[k ]] 2βt. (D.37) Let ϵt = | E xt DX[h(xt)[bk]] E xt DX[h(xt)[k ]]|. Then, based on Lemma D.12 and Lemma D.13, when t T , we have 2(γ + 1)βt ϵ ϵt 1. (D.38) For any t [T] and t < T , we have E xt DX[It] 1. For the round t > T , based on Lemma D.13, it holds uniformly E xt DX[h(xt)[bk]] E xt DX[h(xt)[k ]] = ϵt. Then, based on (D.37), we have E xt DX[f(xt; θt)[bk] f(xt; θt)[k ]] ϵt 2βt E2 2γβt, (D.39) where E2 is because of Eq. (D.38). According to Lemma D.13, when t > T , E xt DX[f(xt; θt)[k ]] = E xt DX[f(xt; θt)[bk]]. Thus, it holds f(xt; θt)[bk] f(xt; θt)[k ] 2γβt, t > T . Then, for the round t > T , we have E xt DX[It] = 0. Then, assume T > T , we have t=1 E xt DX h 1{f(xt; θt)[bk] f(xt; θt)[k ] < 2γβt} i t= T +1 E xt DX h 1{f(xt; θt)[bk] f(xt; θt)[k ] < 2γβt} i Therefore, we have NT T . Theorem 5.2. [Stream-based]. Given T, for any δ (0, 1), λ0 > 0, suppose xt 2 = 1, t [T], H λ0I, m eΩ(poly(T, K, L, S) log(1/δ)), η1 = η2 = Θ( S m T K ). Then, with probability at least 1 δ over the initialization of θ1 1, θ2 1, Algorithm 1 achieves the following regret bound: Rstream(T) O( 2 log(3T/δ) where N(T) O(T). Published as a conference paper at ICLR 2024 Proof. Define Rt = E xt DX h(xt)[bk] h(xt)[k ] . Rstream(T) = t=1 Rt(It = 1 It = 0) t=1 max{Rt(It = 1), Rt(It = 0)} t=1 E (xt,yt) D[ f(xt; θt 1) ut 2] 2 log(3T/δ) 2T log(3T/δ) , where (a) is based on Lemma D.11: Rt(It = 0) = 0 . The proof is completed. Theorem 5.3. [Stream-based]. Given T, for any δ (0, 1), γ > 1, λ0 > 0, suppose xt 2 = 1, t [T], H λ0I, m eΩ(poly(T, K, L, S) log(1/δ)), η1 = η2 = Θ( S m T K ), and Assumption 5.1 holds. Then, with probability at least 1 δ over the initialization of θ1 1, θ2 1, Algorithm 1 achieves the following regret bound and label complexity: Rstream(T) O((KS2 + log(3T/δ))/ϵ), N(T) O((KS2 + log(3T/δ))/ϵ2). (5.2) Proof. Given T , we divide rounds into the follow two pars. Then, it holds that Rstream(T) = t=1 Rt(It = 1 It = 0) t=1 Rt(It = 1) + t= T +1 Rt(It = 0) t=1 E xt DX[h(xt)[bk] h(xt)[k ] t= T +1 E xt DX[h(xt)[bk] h(xt)[k ] For I1, it holds that Rt(It = 1) = E xt DX h(xt)[bk] h(xt)[k ] h(xt)[bk] f(xt; θt)[bk] + f(xt; θt)[bk] h(xt)[k ] (a) E xt DX h(xt)[bk] f(xt; θt)[bk] + f(xt; θt)[k ] h(xt)[k ] E xt DX[|h(xt)[bk] f(xt; θt)[bk]|] + E xt DX[|f(xt; θt)[k ] h(xt)[k ]|] 2 E xt DX[ h(xt) f(xt; θt) ] 2 E (xt,yt) D[ f(xt; θt 1) ut 2] 2 log(3T/δ) Published as a conference paper at ICLR 2024 where (a) is duo the selection criterion of NEURONAL and (b) is an application of D.10. Then, 2 log(3T/δ) 2 log(3T/δ) For I2, we have Rt|(It = 0) = E xt DX[h(xt)[bk] h(xt)[k ]] = 0 based on Lemma D.13. Therefore, it holds that: Rstream(T) (2 p 2 log(3T/δ) . The proof is completed. E ANALYSIS IN BINARY CLASSIFICATION First, we provide the noise condition used in [58]. Assumption E.1 ( Mammen-Tsybakov low noise [43; 58]). There exist absolute constants c > 0 and α 0, such that for all 0 < ϵ < 1/2, x X, k {0, 1}, P(|h(x)[k] 1 Then, we provide the following theorem. Theorem 5.1. [Binary Classification] Given T, for any δ (0, 1), λ0 > 0, suppose K = 2, xt 2 = 1, t [T], H λ0I, m eΩ(poly(T, L, S) log(1/δ)), η1 = η2 = Θ( S m 2T ). Then, with probability at least 1 δ over the initialization of θ1 1, θ2 1, Algorithm 1 achieves the following regret bound: Rstream(T) e O((S2) α+1 α+2 T 1 α+2 ), N(T) e O((S2) α α+2 T 2 α+2 ). In comparison, with assumption E.1, [58] achieves the following results: Rstream(T) e O LH α+2 T 1 α+2 + e O LH α+1 α+2 (S2) α+1 α+2 T 1 α+2 , N(T) e O LH 2α α+2 T 2 α+2 + e O LH α α+2 (S2) α α+2 T 2 α+2 . For the regret Rstream(T), compared to [58], Theorem 5.1 removes the term e O LH α+2 T 1 α+2 and further improve the regret upper bound by a multiplicative factor LH α+1 α+2 . For the label complexity N(T), compared to [58], Theorem 5.1 removes the term e O LH 2α α+2 T 2 α+2 and further improve the regret upper bound by a multiplicative factor LH α α+2 . It is noteworthy that LH grows linearly with respect to T, i.e., LH T log(1 + λ0). Proof. First, we define t = h(xt)[bk] h(xt)[k ] b t = f(xt; θt)[bk] f(xt; θt)[k ] + 2βt t=1 Ext DX [1{ 2 t ϵ2}]. Published as a conference paper at ICLR 2024 Lemma E.1. Suppose the conditions of Theorem 5.1 are satisfied. With probability at least 1 δ, the following result holds: NT O Tϵ + 1 ϵ2 (S2 + log(3T/δ)) . Proof. First, based on Lemma D.11, with probability at least 1 δ, we have 0 b t t 4βt. Because b t 0, then b t 4βt implies t 4βt. Then, we have It = It1{b t 4βt} It1{b t 4βt, 4βt ϵ} + It1{b t 4βt, 4βt < ϵ} 16Itβ2 t ϵ2 1 + 1{ 2 t ϵ2}. Thus, we have t=1 Ext DX [It1{b t 4βt}] t=1 (16Itβ2 t ϵ2) + Tϵ t=1 (16Itβ2 t 1 ϵ2 (2S2 + 2 log(3T/δ)) + Tϵ ϵ2 (2S2 + 2 log(3T/δ))) + Tϵ. The proof is completed. Lemma E.2. Suppose the conditions of Theorem 5.1 are satisfied. With probability at least 1 δ, the following result holds: RT O ϵTϵ + 1 ϵ (S2 + log(3T/δ)) t=1 E xt DX h(xt)[bk] h(xt)[k ] t=1 E xt DX[| t|] t=1 E xt DX[| t|1{| t| > ϵ}] + t=1 E xt DX[| t|1{| t| ϵ}] Published as a conference paper at ICLR 2024 where the second term is upper bounded by ϵTϵ. For the first term, we have t=1 E xt DX[| t|1{| t| > ϵ}] t=1 E xt DX[| t|2] ϵ t=1 E xt DX[| t|21{ t 2βt}] ϵ + 1 t=1 E xt DX[| t|21{ t > 2βt}] ϵ t=1 4β2 t 1 ϵ (2S2 + 2 log(3T/δ)) where the second term in (a) is zero based on the Lemma D.13. The proof is completed. Then, because x1, ..., x T are generated i.i.d. with Assumption E.1. Then, applying Lemma 23 in [58], with probability at least 1 δ, the result holds: Tϵ 3Tϵα + O(log log T Using the above bound of Tϵ back into both Lemma E.1 and Lemma E.2, and then optimizing over ϵ in the two bounds separately complete the proof: RT O((S2 + log(3T/δ)) α+1 α+2 T 1 α+2 ), NT O((S2 + log(3T/δ)) α α+2 T 2 α+2 ). F ANALYSIS FOR POOL-BASED SETTING Define Lt(θt) = f(xt; θt) ut 2 2/2, Lt,k(θt) = (f(xt; θt)[k] ut[k])2/2, and ut = [ℓ(yt,1, yt), . . . , l(yt,K, yt)]. Lemma F.1 (Almost Convexity). With probability at least 1 O(TL2K) exp[ Ω(mω2/3L)] over random initialization, for all t [T], and θ, θ satisfying θ θ1 2 ω and θ θ1 2 ω with ω O(L 6[log m] 3/2), it holds uniformly that Lt(θ ) Lt(θ)/2 + k=1 θLt,k(θ), θ θ ϵ. where ω O (Km log m) 3/8L 9/4ϵ3/4 . Published as a conference paper at ICLR 2024 Proof. Let Lt,k(θ) be the loss function with respect to f(xt; θ )[k]. By convexity of Lt, it holds uniformly that f(xt; θ ) ut 2 2/2 f(xt; θ) ut 2 2/2 k=1 L t,k(θ) f(xt; θ )[k] f(xt; θ)[k] k=1 L t,k(θ) f(xt; θ)[k], θ θ L t,k(θ) [f(xt; θ )[k] f(xt; θ)[k] f(xt; θ)[k], θ θ ] k=1 θLt,k(θ), θ θ |f(xt; θ)[k] ut[k]| |f(xt; θ )[k] f(xt; θ)[k] f(xt; θ)[k], θ θ | k=1 θLt,k(θ), θ θ O ω4/3L3 m log m) k=1 |f(xt; θ)[k] ut[k]| k=1 θLt,k(θ), θ θ O ω4/3L3 m log m) |f(xt; θ)[k] ut[k]|2 + 1 where (a) is due to the convexity of Lt, (b) is an application of triangle inequality, (c) is because of the Cauchy Schwarz inequality, and (d) is the application of Lemma D.4 and Lemma D.3, (e) is by the fact x x2 + 1 4. Therefore, the results hold f(xt; θ ) ut 2 2/2 f(xt; θ) ut 2 2/2 k=1 θLt,k(θ), θ θ O ω4/3L3 m log m) f(xt; θ) ut 2 2 ω4/3L3 m log m) k=1 θLt,k(θ), θ θ 1 4 f(xt; θ) ut 2 2 ω4/3L3 m log m) where (a) holds when ω O 4 3/4K3/8(m log m) 3/8L 9/4 . In the end, when ω O 4 3/4(Km log m) 3/8L 9/4ϵ3/4 , the result hold: f(xt; θ ) ut 2 2/2 f(xt; θ) ut 2 2/4 k=1 θLt,k(θ), θ θ ϵ. The proof is completed. Lemma F.2. With the probability at least 1 O(TL2K) exp[ Ω(mω2/3L)] over random initialization, for all t [T], θ θ1 ω, andω O(L 9/2[log m] 3), the results hold: f(xt; θ ) log m Published as a conference paper at ICLR 2024 Proof. Using Lemma 7.1 in [3], we have gt,L 1 O(1), Then, Based on the randomness of WL, the result hold: f(xt; θ ) log m (analogical analysis as Lemma 7.1). Using Lemma 8.2 in [3], we have g t,L 1 O(1), and thus f(xt; θ ) log m Lemma F.3 (Loss Bound). With probability at least 1 O(TL2K) exp[ Ω(mω2/3L)] over random initialization, it holds that t=1 f(xt; θt) ut 2 2/2 inf θ B(θ0;) t=1 f(xt; θ ) ut 2 2 + 4LKR2 (F.1) where θ = arg infθ B(θ1,R) PT t=1 Lt(θ). Proof. Let ω O (Km log m) 3/8L 9/4ϵ3/4 , such that the conditions of Lemma D.5 are satisfied. The proof follows a simple induction. Obviously, θ1 is in B(θ1, R). Suppose that θ1, θ2, . . . , θT B(θ1, R). We have, for any t [T], t=1 θt+1 θt 2 t=1 η Lt(θt) 2 t=1 f(xt; θt) ut 2 θtf(xt; θt) 2 t=1 f(xt; θt) ut 2 where (a) is applying Lemma F.2 and (b) holds when m eΩ(T 8K 1L14ϵ 6). Moreover, when m > eΩ(R8T 8K3L18ϵ 6), it leads to R m ω. In round t, based on Lemma D.5, for any θt θ 2/2 ω, it holds uniformly f(xt; θt) ut 2 2/4 f(xt; θ ) ut 2 2/2 k=1 θLt,k(θt), θt θ + ϵ. Then, it holds uniformly f(xt; θt) ut 2 2/4 f(xt; θ ) ut 2 2/2 (a) θt θt+1, θt θ (b) = θt θ 2 2 + θt θt+1 2 2 θt+1 θ 2 2 2η + ϵ (c) θt θ 2 2 θt+1 θ 2 2 2η + η k=1 θt Lt,k(θt) 2 F + ϵ θt θ 2 2 θt+1 θ 2 2 2η + η k=1 (f(xt; θt)[k] ut[k]) θtf(xt; θt)[k] 2 2 + ϵ θt θ 2 2 θt+1 θ 2 2 2η + ηm L k=1 |(f(xt; θt)[k] ut[k])|2 + ϵ θt θ 2 2 θt+1 θ 2 2 2η + ηm LK f(xt; θt) ut 2 2 + ϵ Published as a conference paper at ICLR 2024 where (a) is because of the definition of gradient descent, (b) is due to the fact 2 A, B = A 2 F + B 2 F A B 2 F , (c) is by θt θt+1 2 2 = η θLt(θt) 2 2 O(η2KLm). Then, for T rounds, we have t=1 f(xt; θt) ut 2 2/4 t=1 f(xt; θ ) ut 2 2/2 (a) θ1 θ 2 2 2η + t=2 θt θ 2 2( 1 t=1 ηm LK f(xt; θt) ut 2 2 + Tϵ θ1 θ 2 2 2η + t=1 ηm LK f(xt; θt) ut 2 2 + Tϵ 2mη + ηm LK t=1 f(xt; θt) ut 2 2 + Tϵ t=1 f(xt; θt) ut 2 2/8 where (a) is by simply discarding the last term and (b) is setting by η = 1 m LK and ϵ = KR2L T . Based on the above inequality, taking the infimum over θ B(θ , R), the results hold : t=1 f(xt; θt) ut 2 2/8 inf θ B(θ ,R) t=1 f(xt; θ ) ut 2 2/2 + 5LKR2. The proof is completed. Lemma F.4. Let R = S = u h 1u. With probability at least 1 O(TL2K) exp[ Ω(mω2/3L)] , the result holds that inf θ B(θ0;R) t=1 Lt(θ) O(LK) Proof. Let G = [ θ0f(x1; θ0)[1], θ0f(x1; θ0)[2], . . . , θ0f(x T ; θ0)[K]]/ m Rp T K. Suppose the singular value decomposition of G is PAQ , P Rp T K, A RT K T K, Q RT K T K, then, A 0. Define θ = θ0 + PA 1Q u/ m. Then, we have m G (θ θ0) = QAP PA 1Q u = u. This leads to k=1 |ut[k] θ0f(xt; θ0)[k], θ θ0 | = 0. Therefore, the result holds: θ θ0 2 2 = u QA 2Q u/m = u (G G) 1u/m 2u H 1u/m (F.2) Published as a conference paper at ICLR 2024 Based on Lemma D.4 and initialize (xt; θ0) = 0, t [T], we have ut[k] θ0f(xt; θ0)[k], θ θ0 O(ω4/3L3p m log(m)) 2 k=1 (ut[k] θ0f(xt; θ0)[k], θ θ0 )2 + TK O(ω8/3L6m log(m)) k=1 (ut[k] θ0f(xt; θ0)[k], θ θ0 )2 + TK O((S /m1/2)8/3L6m log(m)) k=1 (ut[k] θ0f(xt; θ0)[k], θ θ0 )2 + O(LK) Because θ B(θ0, S ), the proof is completed. Lemma F.5. Suppose m, η1, η2 satisfy the conditions in Theorem 5.4. With probability at least 1 δ, for all t [Q], it holds uniformly that f2(ϕ(xτ); θ2 τ) (h(xτ) f1(xτ; θ1 τ)) 2 1 ut = (ℓ(yt,1, yt), . . . , ℓ(yt,K, yt)) , Ht 1 = {xτ, yτ}t 1 τ=1 is historical data, and the expectation is also taken over (θ1 t, θ2 t). Proof. Then, applying the Hoeffding-Azuma inequality as same as in Lemma D.6, we have f2(ϕ(xτ); θ2 τ) (h(xτ) f1(xτ; θ1 τ)) 2 1 τ=1 f2(ϕ(xτ); θ2 τ) (uτ f1(xτ; θ1 τ)) 2 + τ=1 f2(ϕ(xτ); θ2 τ) (uτ f1(xτ; θ1 τ)) 2 2 + where (a) is an application of Lemma F.3 and Lemma F.4 and (b) is applying the Hoeffding-Azuma inequality again on S . The proof is complete. Lemma F.6. Suppose m, η1, η2 satisfy the conditions in Theorem 5.4. With probability at least 1 δ, the result holds: xi Pt pi(h(xi)[bk] h(xi)[k ]) γ xi Pt pi(f(xi; θt)[k ] h(xi)[k ])2 + Q QKS2 + O(log(1/δ)) Proof. For some η > 0, we have Published as a conference paper at ICLR 2024 xi Pt pi(h(xi)[bk] h(xi)[k ]) η xi Pt pi(f(xi; θt)[k ] h(xi)[k ])2 xi Pt pi[(h(xi)[bk] h(xi)[k ]) η(f(xi; θt)[k ] h(xi)[k ])2] xi Pt pi[(h(xi)[bk] f(xi; θt)[bk]) + (f(xi; θt)[bk] h(xi)[k ]) η(f(xi; θt)[k ] h(xi)[k ])2] xi Pt pi[(h(xi)[bk] f(xi; θt)[bk]) + (f(xi; θt)[k ] h(xi)[k ]) η(f(xi; θt)[k ] h(xi)[k ])2] xi Pt pi(h(xi)[bk] f(xi; θt)[bk]) + Q t=1 ut f(xt; θt) 2 + O(log(1/δ)) + Q QKS2 + O(log(1/δ)) + Q where (a) is an application of AM-GM: x ηx2 1 η and (b) is application of Lemma F.7. (c) is based on Lemma F.5. Then, replacing η by γ 4 completes the proof. Lemma F.7. (Lemma 2, [25]) With probability 1 δ, the result holds: xi Pt pi f(xi; θt) h(xi) 2 2 2 t=1 f(xt; θt) ut 2 2 + O(log(1/δ)). Finally, we show the proof of Theorem 5.4. Proof. Assume the event in F.7 holds. We have h E (xt,yt) D[ℓ(yt,bk, yt)] E (xt,yt) D[ℓ(yt,k , yt)] i h E xt DX[h(xt)[bk]] E xt DX[h(xt)[k ]] i t=1 E[h(xt)[bk] h(xt)[k ]] + p 2Q log(2/δ) xi Pt pi(h(xi)[bk] h(xi)[k ]) + p 2Q log(2/δ) where (a) is an application of Azuma-Hoeffding inequality. Next, applying Lemma F.6 and Letting ξ = p QKS2 + O(log(1/δ)), we have Published as a conference paper at ICLR 2024 Rpool(Q) (a) γ xi Pt pi(f(xi; θt)[k ] h(xi)[k ])2 + Q 2Q log(2/δ) + ξ xi Pt pi f(xi; θt) h(xi) 2 2 + Q 2Q log(2/δ) + ξ t=1 f(xi; θt) ut) 2 + γ log(δ 1)/4 + Q 2Q log(2/δ) + ξ 2 KS2 + γ log(2δ 1)/4 + Q 2Q log(2/δ) + p QKS2 + O(log(1/δ)) Q KS2 O(log(δ 1)) + O( p 2Q log(3/δ)) where (a) follows from F.6, (b) follows from F.7, (c) is based on Lemma F.5, and we apply the union bound to the last inequality and choose γ = q Q KS2 . Therefore: Rpool(Q) O( p log(δ 1) + O( p 2Q log(3/δ)) The proof is complete.