# generalized_prediction_set_with_bandit_feedback__55d17629.pdf

Published in Transactions on Machine Learning Research (05/2025)

Generalized Prediction Set with Bandit Feedback

Zhou Wang zwang198@binghamton.edu Department of Mathematics and Statistics Binghamton University, the State University of New York

Xingye Qiao xqiao@binghamton.edu Department of Mathematics and Statistics Binghamton University, the State University of New York

Reviewed on Open Review: https: // openreview. net/ forum? id= Vlwq Iz41Hp

In high-stakes environments where uncertainties abound, set-valued prediction offers a cautious and robust mechanism by presenting multiple potential labels as the prediction for each test instance to mitigate the potential risk associated with prediction errors. Yet, integrating this paradigm with out-of-distribution (OOD) detection remains scarcely explored in such settings as online learning with bandit feedback. The bandit feedback mechanism informs the learner about the correctness of the pulled arm/action instead of the explicit ground truth label, leaving the true class label unknown when an incorrect action is taken. To address this challenge, we introduce Bandit GPS which conducts set-valued prediction with OOD detection in the bandit feedback setting, using an estimation to the ground truth of class labels. Bandit GPS achieves three objectives: render small/informative prediction sets, enhance the OOD detection performance, and control the recall for all normal classes to meet prescribed requirements. Our approach is characterized by the loss function, which trades off between high OOD detection and small prediction sets. Theoretically, we prove that the convergence rate of the regret is O(T 1/2). The empirical results further show that Bandit GPS effectively controls the recalls with promising performances on OOD detection and informative prediction.

1 Introduction

Conventional single-valued prediction assigns only a single class label to each instance without a provable confidence guarantee. This paradigm could be overly-confident for some data instances, and thereby lead to erroneous decisions with severe consequences. Such an issue is particularly pronounced in critical domains where instances are characterized by high uncertainties. For example, in medical triage, based on a single-label prediction, an AI system might incorrectly categorize a patient with subtle symptoms of a severe condition as low-risk, delaying the critical treatment and hence endangering the patient s life. Similarly, in financial fraud detection, a single-valued prediction system might not adequately account for the nuanced behaviors of complex fraud schemes, resulting in either over-blocking of legitimate transactions (causing inconvenience and potential loss of business) or under-detecting sophisticated fraud tactics (leading to financial losses). To reduce these associated risks from those observations with high uncertainty, set-valued classification, which reports multiple possible class labels, may be employed. This approach allows for human intervention in those difficult data instances, enabling further error reduction through follow-up investigations.

The literature on set-valued prediction is diverse, with several different types of methodologies. (1) Classification with Reject Option (CRO) (Herbei & Wegkamp, 2006; Bartlett & Wegkamp, 2008; Charoenphakdee et al., 2021) incorporates a rejector into the loss function, which is used to abstain from classification if an instance is highly ambiguous. This is equivalent to assigning all labels to the instance, resulting in a less informative prediction set. Zhang et al. (2018) introduces a refined option to produce a more informative/smaller

Published in Transactions on Machine Learning Research (05/2025)

prediction set. However, CRO and its variants, despite considering the uncertainty of instances, do not control the accuracy under the set-valued prediction paradigm, i.e., the probability of the true label being included in the prediction set. (2) In contrast, Conformal Prediction (CP) (Shafer & Vovk, 2008; Vovk et al., 2005) offers a prediction set with a theoretical guarantee of accuracy control. CP, assuming data exchangeability, is a distribution-free framework that works alongside machine learning models, e.g., neural networks, to facilitate set-valued prediction. The caveat with CP is that the prediction set may be less informative due to not explicitly minimizing the set size. Romano et al. (2020); Angelopoulos et al. (2021) have focused on developing score functions to reduce the prediction set size to alleviate this limitation. (3) Confidence Set Learning (CSL) (Wang & Qiao, 2018; 2022a; Sadinle et al., 2019) approaches set-valued prediction from a constrained optimization perspective, explicitly minimizing the prediction set size while controlling accuracy. Conversely, Denis & Hebiri (2017; 2020) seek to maximize the accuracy of the prediction given a size budget.

All the above three camps concentrate on set-valued prediction within a closed-world assumption, where each instance is guaranteed at least one known class label. Nevertheless, the dynamic nature of the open world presents the inevitability of encountering new, unknown classes over time. This reality necessitates the unified process of not only classifying known entities but also detecting novel classes, a task termed out-of-distribution (OOD) detection or open-set recognition (Yang et al., 2021; Lee et al., 2018). Traditional approaches to OOD detection encounter limitations akin to those of single-valued prediction systems. In an effort to bridge this gap, by admitting an empty set for potential atypical points, recent innovations have expanded set-valued classification to incorporate an OOD detection component, such as Cautious Deep Learning (CDL) (Hechtlinger et al., 2018), Balanced Conformal Prediction Set (BCOPS) Guan & Tibshirani (2022), Generalized Prediction Set (GPS) (Wang & Qiao, 2023; 2022b), and Deep Generalized Prediction Set (Deep GPS) (Wang & Qiao, 2024a). CDL and BCOPS, rooted in Conformal Prediction (CP), unfortunately inherit its predisposition towards larger prediction sets. Conversely, GPS and Deep GPS strive for informative prediction by explicitly minimizing prediction set sizes, aligning with the principles of Confidence Set Learning (CSL).

However, these advancements predominantly address scenarios of offline learning with full feedback, where detailed label information for each instance is accessible even following an incorrect prediction. This starkly contrasts with the online bandit feedback setting, where a learner s received feedback is limited to the binary outcome of an arm/action s success, devoid of explicit insights across all labels. For example, within clinical trials, an adaptive trial design system (learner) might select a treatment (arm), e.g., a type of drug, without knowing its efficacy beyond the absence of disease remission, thus lacking comprehensive feedback on optimal treatments. Although Bandit Class-specific Conformal Prediction (BCCP) (Wang & Qiao, 2024b) represents a step towards accommodating set-valued classification in a closed world under bandit feedback, it lacks mechanisms for OOD detection. Given the limited research in this domain, our study pioneers the exploration of generalizing the set-valued prediction to have the capacity of OOD detection under the bandit feedback setting.

In this article, we have made the following four major contributions. Methodologically, we introduce Bandit GPS, an adaptation of the Generalized Prediction Set for online learning with bandit feedback, marking a novel intersection of these research areas. Without accessing the explicit label information, the proposed approach handles the bandit feedback by utilizing an estimation of the ground truth of the class label. Secondly, unlike the offline training regime of Deep GPS (Wang & Qiao, 2024a), Bandit GPS dynamically adjusts to maintain class-specific recall (to be defined later) for normal classes with fewer manual parameter adjustments, leading to a feasible deployment for real-world online learning applications. Thirdly, unlike its predecessors, the Bandit GPS allows explicit control of the balance between informative prediction and OOD detection effectiveness, making it easy for practitioners to use given their specific business needs. Lastly, we theoretically prove a regret convergence rate of O(T 1/2) which is competitive in the literature; the empirical study confirms the efficacy of Bandit GPS.

The rest of this article is organized as follows. In Section 2, we delve into key notations and review works that lay the groundwork for our study. Section 3 articulates the problem formulation and introduces the Bandit GPS method with an algorithm. The convergence rate of regret for the algorithm is theoretically explored in Section 4. In Section 5, we present empirical evidence demonstrating the effectiveness of Bandit GPS. A

Published in Transactions on Machine Learning Research (05/2025)

conclusion is offered in Section 7. Additional materials, including detailed proofs and extended experiments, are provided in Section A.

2 Preliminary

In this section, we discuss some notions and briefly review seminal works across three pivotal areas: set-valued classification, out-of-distribution detection, and the multi-armed Bandit problem.

Set-valued Classification: Let (X, Y ) X Y be an observation generated from an unknown distribution, where Y = {1, . . . , K}. A set-valued classifier is a map ϕ : X 7 2Y to return a prediction set ϕ(X) for the instance X based on a certain rule. Commons metrics (Vovk et al., 2017) to evaluate the performance of a set-valued classifier include the (expected) prediction set size and the accuracy. The prediction set size is defined through the cardinality of the prediction set, i.e., |ϕ(X)|, which gauges the number of class labels predicted for instance X. Because both the classifier ϕ( ) and the instance X are random, we typically focus on E[|ϕ(X)|]. The accuracy in the set-valued prediction paradigm is defined as the probability that the true label of an observation is included in its prediction set, i.e., P[Y ϕ(X)], which is also referred to as coverage probability in the CP literature (Vovk et al., 2005).

The accuracy P[Y ϕ(X)] represents the overall performance of the classifier over all classes. For each class, we denote the (class-specific) recall as P[Y ϕ(X) | Y = k] to measure the classifier s performance on class k. Intuitively speaking, there is a trade-off between prediction set size and recall. Higher recall for normal classes often comes with a larger prediction set ϕ(X) (because this allows an increased likelihood for the true class label being included in ϕ(X)). While CP-based classifiers (Vovk et al., 2005; Romano et al., 2020; Angelopoulos et al., 2021) offer a controlled recall guarantee, their prediction set sizes are contingent on the chosen score function. In contrast, CSL-based approaches (Sadinle et al., 2019; Wang & Qiao, 2018; 2022a; 2023) strive to minimize prediction set sizes while maintaining the normal class recall.

Out-of-distribution (OOD) Detection and Open-set Recognition (OSR): The goal of OOD and OSR (Bendale & Boult, 2016; Lee et al., 2018; Charpentier et al., 2020; Yang et al., 2021) is to conduct single-valued prediction for instances in the normal classes in addition to rejecting atypical observations that potentially comes from a new or anomaly class that was not present in the training data. Recently, Selective Classification with OOD detection (SCOD) (Xia & Bouganis, 2022; Zhu et al., 2023) and Unified Open-set Recognition (UOSR) (Kim et al., 2023; Cen et al., 2023) further consider identifying difficulty observations (hard to distinguish among existing normal classes) when they are not rejected as OOD points. The difficult observations and OOD points in both SCOD and UOSR can be respectively viewed as the ambiguous instances with prediction set sizes greater than 1 and the atypical instances with the empty prediction set in the set-valued prediction paradigm.

Multi-armed Bandit (MAB): The MAB framework (Lai & Robbins, 1985; Auer et al., 2002) concerns a decision-making scenario where a learner sequentially pulls an arm A from a set {1, . . . , K} to maximize cumulative rewards over time. To this end, various policies π (could be either a probability distribution or deterministic rule) are proposed to guide the arm pulling, e.g., epsilon-greedy (Sutton & Barto, 2018), Upper Confidence Bound (Auer et al., 2002), and Thompson sampling strategies (Thompson, 1933), etc. The policy π and the pulled arm A in the MAB can be conceptually treated as a form of decision rule and a predicted label in the classification problem.

MAB problems are prevalent in various fields, such as online advertising, clinical trials, and recommendation systems, where a context X informs personalized decision-making. Specifically, after receiving the context X from an environment, a learner pulls an arm A that follows π( | X), and then receives a feedback 1{A = Y } returned by the environment. Such bandit feedback only indicates the correctness of the pulled arm. If the feedback is 0, the learner does not know the ground truth and hence faces a challenge in optimizing reward strategies. While existing studies (Kakade et al., 2008; Wang et al., 2010; Crammer & Gentile, 2013; Abbasi-Yadkori et al., 2011; Gollapudi et al., 2021; van der Hoeven et al., 2021) have explored linear and neural network-based (Zhou et al., 2020; Jin et al., 2021; Zhang et al., 2021; Xu et al., 2022) approaches to the bandit problem, they seldom address set-valued classification and/or OOD detection within this framework.

Published in Transactions on Machine Learning Research (05/2025)

Notably, while research (Taufiq et al., 2022; Zhang et al., 2023; Stanton et al., 2023) has explored set-valued prediction in reinforcement learning, their focus diverges from our bandit-centric investigation. The recent work BCCP (Wang & Qiao, 2024b) is most aligned with our work, yet it does not directly consider the task of OOD detection, a practical requirement in applications. For instance, in social content moderation, a system needs to flag user-generated content into some categories (safe, offensive, etc), and detect new types of harmful content as well. Our study, therefore, introduces a novel approach, considering set-valued classification with OOD detection in the context of bandit feedback, addressing a significant gap in the literature.

Set-valued Prediction with Bandit Feedback: The learning framework operates under bandit feedback, where only correctness information for a selected action is observed. At each time step t, given a context Xt, a learner first produces a set-valued prediction ϕt(Xt) with a certain theoretical guarantee, indicating a subset of potential labels. Subsequently, a learner takes an action At (denoting a single label) according to the policy πt, where At is sampled from all possibilities. The environment provides a binary feedback signal that whether the pulled arm At confirms the ground true label of Xt (see Algorithm 1).

3 Methodology

In this section, we develop and optimize our set-valued classifier, Bandit GPS, for the online bandit feedback environment. Bandit GPS is engineered to accomplish three critical objectives: (1) generate small/informative prediction sets; (2) proficiently detect OOD queries; and (3) maintain the recall for each normal class to be above a threshold of 1 γ, where γ is a tolerance level prescribed by practitioners. The algorithmic framework and operational flow of Bandit GPS are outlined in Algorithm 1. Throughout the article, we use [K] to denote the set {1, . . . , K} for convenience.

Consider a sequence of i.i.d. samples with inaccessible labels (Xt, Yt) X Y, t = 1, , T generated from the environment. Within this setting, we assume K established normal classes known from historical data. Nonetheless, we anticipate the emergence of OOD instances over time, which may not fit within these predefined classes and potentially signify novel, yet-to-be-identified classes.

To navigate this task, we leverage a neural network as our hypothesis class F. Specifically, we define a decision function vector f Wt F, parameterized by weights Wt at time t, where f Wt(X) := (f 1 Wt(X), , f K Wt(X)) RK encapsulates the decision functions for each of the K normal classes. For any given query Xt revealed at time t, the prediction set is constructed as,

ϕt(Xt) := k [K] : f k Wt(Xt) 0 . (1)

Intuitively speaking, class k is considered to be a plausible label for query Xt if the query falls into the acceptance region for class k defined by {x : f k Wt(x) 0}. Note that the union of all the K many acceptance regions may not cover the entire space X, and regions may overlap with each other.

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 u

3.0 {u 0} [1 + u] + [1 u] +

Figure 1: Surrogate loss functions

Prediction Set Size Minimization: Based on the above definition of the prediction set, the prediction set size for the query Xt is naturally quantified as |ϕt(Xt)| := PK k=1 1{f k Wt(Xt) 0}, which ranges from 0 to K. In particular, size 0 denotes that the classifier believes Xt is an OOD instance, indicating its distinctiveness from all normal classes. Conversely, size K signifies the maximal ambiguity associated with the query, suggesting that it blurs the lines across all normal classes.

Due to the discontinuous, and hence, computationally intractable 0-1 loss 1{u 0} involved in the definition of prediction set size, we employ a surrogate hinge loss ℓ1(u) := [1+u]+ = max(0, 1+u) as a proxy for the 0-1 loss (see Figure 1). This substitution enables a practical way to return a smaller (more informative) prediction set by minimizing the loss

k=1 ℓ1(f k Wt(Xt)) =

k=1 [1 + f k Wt(Xt)]+. (2)

Published in Transactions on Machine Learning Research (05/2025)

OOD Detection Maximization: A query instance Xt is not flagged as OOD if and only if its prediction set is non-empty, i.e., |ϕt(Xt)| > 0. This is equivalent to the existence of at least one class k [K] such that f k Wt(Xt) 0, or equivalently, maxk [K] f k Wt(Xt) 0. Therefore, to enhance the OOD detection performance, it is imperative to penalize the indicator function of the occurrence of this event, 1{maxk [K] f k Wt(Xt) 0}. Given the similar challenge imposed by the 0-1 loss as mentioned above, to effectively tackle the OOD detection task, we instead minimize the following:

ℓ1 max k [K] f k Wt(Xt) = 1 + max k [K] f k Wt(Xt)

Recall Control for Normal Classes: By relating the definition of prediction set (1), the class-specific recall is further expressed as P [Yt ϕt(Xt) | Yt = k] = P f k Wt(Xt) 0 | Yt = k . Thus, controlling the class-specific normal class recall to the prescribed value at least 1 γ, that is,

P f k Wt(Xt) 0 | Yt = k = E 1{f k Wt(Xt) 0} | Yt = k 1 γ, for k [K]

is equivalent to E 1{Yt = k} 1{f k Wt(Xt) < 0} γ 0, for k [K] due to the Bayes theorem and some algebra (see details in Appendix A.2). This inequality can be viewed as a constraint in constrained optimization problems. Equivalently, our objective function incorporates the left-hand side of this constraint as additional regularization terms, that is,

E 1{Yt = k} 1{f k Wt(Xt) < 0} γ , for k [K]. (4)

In the bandit feedback setting, the ground truth Yt and hence the quantities 1{Yt = k}, k [K] are unobservable since the learner has no direct access to the ground truth of the label Yt once receives the query Xt. Instead, it only knows whether its pulled arm At based on a policy πt is correct or not, i.e., the feedback 1{At = Yt}. Particularly, if the feedback is 0, the learner has no idea which label is the true class. To overcome this challenge, in practice, the ground truth 1{Yt = k} (the first indicator function in (4)) can be replaced by an unbiased estimation conditional on (Xt, Yt),

t,k := 1{At = k}

πt(k | Xt) 1{At = Yt}, (5)

due to the fact Eπt [ t,k | Xt, Yt] = 1{Yt = k}, where πt governs the distribution of At given Xt. Different from the K-arm policy, with arms from [K] reviewed in Section 2, we utilize an augmented policy with an additional arm, namely ood, corresponding to the case that the instance comes from the OOD class. The definition of policy πt is deferred to the next subsection.

To the same token, the intractable 0-1 loss 1{u < 0} in the second indicator function in (4) can be substituted by the surrogate hinge loss [1 u]+ := max(0, 1 u). This amounts to replacing 1{u < 0} γ in (4) by ℓ2,γ(u) := [1 u]+ γ. Thus, instead of working with the loss function in (4), we can minimize the empirical average of the following loss function in practice:

t,k ℓ2,γ(f k Wt(Xt)) := t,k h 1 f k Wt(Xt)

+ γ i . (6)

Policy Design: In this article, our policy πt is constructed based on the scoring function f Wt(x) = (f 1 Wt(x), , f K Wt(x)) associated with the normal classes. Specifically, the policy is defined as,

πt(k | Xt) := exp(f k Wt(Xt))

exp( maxj [K] f j Wt(Xt)) + PK j=1 exp(f j Wt(Xt)) , k [K],

πt(ood | Xt) := exp( maxj [K] f j Wt(Xt))

exp( maxj [K] f j Wt(Xt)) + PK j=1 exp(f j Wt(Xt)) .

Note that πt(a | Xt), a {1, . . . , K, ood}, sum up to 1 and hence is a legitimate probability distribution for At. Under this policy, the probability of pulling arm k is proportional to the exponentiated score for class k,

Published in Transactions on Machine Learning Research (05/2025)

while the probability of pulling ood is proportional to the exponentiated negative maximum score over all normal classes. Hence, a small score across all normal classes indicates a strong likelihood of the query being from the OOD class.

The Overall Objective Function: Finally, taking into consideration three goals, i.e., minimizing prediction set size (2), enhancing OOD detection (3), and controlling class-specific recall (6) for normal classes, the unified risk function for each data instance that we aim to minimize is

L(Xt; Wt) := λ

k=1 ℓ1(f k Wt(Xt)) + (1 λ)ℓ1(max k [K] f k Wt(Xt))+ K P

k=1 λt,k t,kℓ2,γ(f k Wt(Xt)), (7)

where the neural network f Wt is updated by the stochastic gradient descent (or its variants). Here λ [0, 1] is a user-specified value to indicate the trade-off between prediction set size and OOD detection performance. Note that we do not treat λ as a tuning parameter and rather leave it to be specified by the user according to their unique needs.

Algorithm 1 Bandit GPS

Require: Initialize weight matrices W1 and λ1,k, k [K]. Given λ and learning rates η1, η2.

1: for t = 1, 2, 3, . . . , T do 2: Learner receives Xt and returns a prediction set

ϕt(Xt) := k [K] : f k Wt(Xt) 0

3: Learner pulls an arm At πt and computes t,k 4: Update parameters: ( Wt+1 = Wt η1 WL(Xt; Wt)

λt+1,k = λt,k + η2 t,kℓ2,γ(f k Wt(Xt))

As for the parameter λt,k, an improperly chosen λt,k may lead to two cases: a λt,k that is too small might fail to control the recall for normal classes, while a λt,k that is too large can cause the empirical normal recall highly above the prerequisite 1 γ and hence enlarge the prediction sets and impair the OOD detection performance. Additionally, an appropriate value of λt,k can vary across tasks, making it difficult to predefine an optimal value. To address these limitations, we employ the primal-dual algorithm to dynamically optimize the value of λt,k (see Equation (8) in Algorithm 1). Specifically, if the current prediction is over control, i.e., ℓ2,γ(f k Wt(Xt)) < 0, Equation (8) will adaptively lead to a smaller value of the parameter λt+1,k in the next iteration. Such adjustment helps potentially return smaller prediction sets. Similarly, when ℓ2,γ(f k Wt(Xt)) > 0, Equation (8) will lead to a larger value of λt+1,k in the next iteration. This automatic updating rule frees us from the burden of manually selecting the value of λt+1,k with excessive effort.

4 Main Theorems

This section provides the convergence rate analysis concerning the regret associated with prediction set size and OOD detection performance. Our analysis builds upon foundational work in over-parameterized neural networks (Du et al., 2019; Allen-Zhu et al., 2019; Cao & Gu, 2019; 2020; Chen et al., 2023), i.e., sufficiently wide neural networks.

Let the hypothesis class of Re LU neural networks with depth L and a constant width m be

F := f W : Rp 7 RK | f W( ) = WLσL 1 WL 1σL 2 σ1(W1( ) ,

where W := (Vec(W1) , , Vec(WL) ) denotes a concatenated long vector by vectorizing W1 Rm p, Wl Rm m for 2 l L 1, and WL RK m. The Re LU activation functions σl, l [L 1] are element-wisely applied on each layer. Additionally, we define the ball around the neural network s initialization as B(W1, ω) := {W : Wl W1 l 2 ω, l [L]}. In this article, the Frobenius norm of a matrix is denoted by 2 and this notation extends to tensors.

Consider a restricted hypothesis class that adheres to the empirical normal recall control within this ball whose radius scales in the order of m 1/2:

F+ := f W F : 1

t=1 1{Yt = k} ℓ2,γ(f k W(Xt)) 0, k [K] and W B(W1, Rm 1/2) ,

Published in Transactions on Machine Learning Research (05/2025)

0 1000 2000 3000 4000 5000 6000 t

Bird Cat Deer Dog Frog Horse

0 1000 2000 3000 4000 5000 6000 t

aquatic mammals flowers fruit and vegetables natural outdoor scenes omnivores and herbivores medium-sized mammals invertebrates reptiles small mammals trees

0 1000 2000 3000 4000 5000 6000 t

Digit 1 Digit 2 Digit 4 Digit 6 Digit 7 Digit 9

Figure 2: Updated λt,k for normal class across three datasets when λ = 1

where R > 0 is a fixed constant. We define the loss function as ℓ(f W(X)) := λ K PK k=1 ℓ1(f k W(X)) + (1 λ)ℓ1(maxk [K] f k W(X)) which consists of the loss on prediction set size and the loss on OOD detection; they may be viewed as a negative reward and both are parts of the objective function L(X; W) in Equation (7). In this context, regret is defined as the difference between the loss achieved by the algorithm and the least loss that could have been achieved by the best possible action taken in every iteration. To derive the rate of convergence for the regret, we use the following three assumptions.

Assumption 1. We assume the entries in W1 l , l [L 1] are initialized with Gaussian distribution N(0, 2

m) and the entries in W1 L are initialized with Gaussian distribution N(0, 1

Assumption 2. The instance Xt Rp, t [T] is normalized to have a unit norm, i.e., Xt 2 = 1.

Assumption 3. Cλ, C > 0, s.t. for each k [K], t [T], λt,k Cλ, t,k C .

Remark 4. Assumption 1 and Assumption 2 are standard assumptions in the regime of over-parameterized neural networks. Cao & Gu (2020) extends Assumption 2 to the one with bound norms.

The assumption of the bounded λt,k is empirically verified as in Figure 2 for the case of λ = 1. It shows that λt,k 6 holds for all t and k across the three datasets we work with. The assumption of the boundness of t,k is not strong either because its upper bound is inversely related to the lower bound of πt(k | Xt) = 0 (see Equation (5)), which can be manually manipulated in practice.

Theorem 5. Let pk := P[Yt = k] be the prior probability for normal class k [K]. The average recall for any class k over all time for the prediction sets returned by Algorithm 1 is guaranteed:

t=1 P [Yt ϕt(Xt) | Yt = k] 1 γ Cλ Tη2pk k.

The above probability is taken over all the randomness during the learning process. If the learning rate η2 = O(T α) with α [0, 1), Theorem 5 implies that, on average, the accumulative normal recall approaches to the prescribed value 1 γ with rate O(T α 1). Note that the recall bound may be less meaningful in a short-term regime. Intuitively, if pk is very small, it means that we may not collect sufficient samples for class k within a finite time horizon, leading to potential failures in recall control. However, the denominator in Theorem 5 includes T, compensating for this issue in the long run.

Theorem 6. Define an optimal learner with controlled normal recalls from the hypothesis class F+ as f W := argminf W F+ 1

T PT t=1 ℓ(f W(Xt)). There exists an absolute constant κ satisfying that, for any constant R > 0, m O(κ 2R2L12[log(m)]3), and the learning rate η1 = O( R

KT m), with probability at least

1 O(TL2) exp( Ω(mω2/3L)) over the randomness of the initialization W1, the regret is bounded as

Regret(T) := 1

t=1 ℓ(f Wt(Xt)) 1

t=1 ℓ(f W (Xt)) c O LR

log(m) m1/6

where the constant c := ( λ

KCλC )2 + λ

Theorem 6 shows that when the width of the neural network is sufficiently large, Algorithm 1 leads to the regret convergence rate with the order O(T 1/2 + m 1/6p

log(m)). Particularly, if the width further satisfies

Published in Transactions on Machine Learning Research (05/2025)

0 2000 4000 6000 0.00

Acum_Normal_Max

0 2000 4000 6000 0.00

1.00 CIFAR100

Acum_Normal_Max

0 2000 4000 6000 0.00

Acum_Normal_Max

0 2000 4000 6000 0.00

Acum_OOD Acum_Normal_Min

0 2000 4000 6000 0.00

Acum_OOD Acum_Normal_Min

0 2000 4000 6000 0.00

Acum_OOD Acum_Normal_Min

0 2000 4000 6000 t

Prediction set size

Acum_Card Acum_Cond_Card

0 2000 4000 6000 t

Acum_Card Acum_Cond_Card

0 2000 4000 6000 t

Acum_Card Acum_Cond_Card

Method Bandit GPS BCCP Full-feedback

Figure 3: Comparison among methods based on accumulative performance.

m O(T 3), then the regret is close to the parametric rate O(T 1/2), which is similar to the convex online learning. Similar convergence rate results were previously shown in Allen-Zhu et al. (2019); Cao & Gu (2019); Chen et al. (2023) despite different settings.

5 Experiments

Baselines: Due to the unique setting and task, BCCP (Wang & Qiao, 2024b) represents the most aligned approach to our context even though it lacks mechanisms for OOD detection. To establish a baseline for comparison, we extend BCCP to also output an empty set, indicating OOD instances. Additionally, we present the performance of our model under the full feedback scenario, where the model is trained based on the labeled data, in order to show the performance in an ideal setting as a benchmark.

Set-up: We compare the methods by evaluating them on CIFAR10, CIFAR100, and SVHN datasets. For CIFAR10, we set {Bird, Cat, Deer, Dog, Frog, Horse} as normal classes while all the remaining {Airplane, Car, Ship, Truck} as the OOD class; for CIFAR100, we treat 10 coarser classes {Aquatic mammals, Flowers, Fruit and vegetables, Natural outdoor scenes, Omnivores and herbivores, Medium-sized mammals, Invertebrates, Reptiles, Small mammals, Trees} as normal classes, and the remaining 10 coarser classes as the entire OOD class; for SVHN, we let Digits {1, 2, 4, 6, 7, 9} as the normal classes and the remaining Digits {0, 3, 5, 8} be the OOD superclass.

We let all the experiments have the same desired recall 1 γ = 0.95 across datasets, utilizing Res Net (He et al., 2016) as the backbone architecture, Adam for optimization, learning rate η1 = 10 4 for network updates, and η2 = η2(t) = t 1/2 for optimizing λt,k. To improve the computational efficiency, model updates employ batch data with a size of 256 in each iteration, with about 6000 total iterations.

Metrics: For each iteration t [T], we report several accumulative quantities (see definitions in Table 1): Acum_OOD(t), Acum_Normal_Min(t), Acum_Normal_Max(t) show the OOD detection performance and the recall control on normal classes while Acum_Card(t), Acum_Cond_Card(t) assess the prediction set

Published in Transactions on Machine Learning Research (05/2025)

Table 1: Displayed Metrics for Comparison

Accumulative metric Metric on the hold-out dataset

Acum_Normal_Min(t) = mink [K] Acum_recall(t, k) Hdt_Normal_Min(t) = mink [K] Hdt_recall(t, k)

Acum_Normal_Max(t) = maxk [K] Acum_recall(t, k) Hdt_Normal_Max(t) = maxk [K] Hdt_recall(t, k)

Acum_OOD(t) = Acum_recall(t, OOD) Hdt_OOD(t) = Hdt_recall(t, OOD)

Acum_Cond_Card(t) =

Xi Bs |ϕt(Xi)| 1{Yi =OOD} Pt

Xi Bs 1{Yi =OOD} Hdt_Cond_Card(t) = Pn

i=1 |ϕt(Xi)| 1{Yi =OOD} Pn

i=1 1{Yi =OOD}

Acum_Card(t) =

Xi Bs |ϕt(Xi)| Pt

s=1 |Bs| Hdt_Card(t) = 1

n Pn i=1 |ϕt(Xi)|

size. Additionally, we assess classifiers performance on a fixed holdout labeled dataset after each iteration. In Table 1, Bs denotes the batch of data at iteration s, and n denotes the sample size of the holdout dataset, and

Acum_recall(t, k) Acum_recall(t, OOD)

Xi Bs 1{Yi = k & Yi ϕt(Xi)} Pt s=1 P

Xi Bs 1{Yi = k} , k [K] :=

Xi Bs 1{Yi = OOD & |ϕt(Xi)| = 0} Pt s=1 P

Xi Bs 1{Yi = OOD}

Hdt_recall(t, k) Hdt_recall(t, OOD)

:= Pn i=1 1{Yi = k & Yi ϕt(Xi)} Pn i=1 1{Yi = k} , k [K] := Pn i=1 1{Yi = OOD & |ϕt(Xi)| = 0} Pn i=1 1{Yi = OOD}

0 2000 4000 6000 0.00

Hdt_Normal_Max

0 2000 4000 6000 0.00

1.00 CIFAR100

Hdt_Normal_Max

0 2000 4000 6000 0.00

Hdt_Normal_Max

0 2000 4000 6000 0.00

Hdt_OOD Hdt_Normal_Min

0 2000 4000 6000 0.00

Hdt_OOD Hdt_Normal_Min

0 2000 4000 6000 0.00

Hdt_OOD Hdt_Normal_Min

0 2000 4000 6000 t

Prediction set size

Hdt_Card Hdt_Cond_Card

0 2000 4000 6000 t

8 Hdt_Card Hdt_Cond_Card

0 2000 4000 6000 t

Hdt_Card Hdt_Cond_Card

Method Bandit GPS BCCP Full-feedback Figure 4: Comparison among methods based on performance on the fixed holdout datasets.

Published in Transactions on Machine Learning Research (05/2025)

Particularly, the first two rows in Table 1 show the extreme cases of recall control on normal classes in each iteration. The third row displays the OOD detection performance. In terms of prediction informativeness, the fourth row reports the prediction set size exclusively on all normal classes while the last row presents the prediction set size on all observations, including those from the OOD class.

Results: For a fair comparison with the BCCP where OOD detection is not particularly targeting, we first set λ = 1 in the risk function L( , ) of Bandit GPS, implying that the optimization of the OOD section performance is not a priority either. The top panel in Figure 3 demonstrates that both Bandit GPS and BCCP effectively manage the accumulative class-specific normal recall, as illustrated by the colored dashed curves showing the smallest recall over all normal classes, which approach 1 γ as iterations progress. On the other hand, Bandit GPS distinguishes itself from BCCP with superior OOD detection capabilities (solid curves in the top panel) and smaller prediction sets (bottom panel). Further assessment on a holdout labeled dataset (see Figure 4) reinforces Bandit GPS s enhanced performance over BCCP, showcasing its robust and improved OOD detection and informative prediction set. With the full feedback information, metrics in Figures 3 and 4 quickly improve as shown in green curves.

Trade-offs between Prediction Set Size and OOD Recall Driven by λ: The parameter λ affects the trade-off between metrics on prediction set size and OOD recall. In this section, we study these two metrics by varying the values of λ, i.e., 0, 0.25, 0.5, 0.75, and 1. As shown in Figures 5 and 6, the smaller the λ is, the larger the prediction set will be, despite the improvement in the OOD recall.

0 2000 4000 6000 0.00

Acum_Normal_Max

0 2000 4000 6000 0.00

1.00 CIFAR100

Acum_Normal_Max

0 2000 4000 6000 0.00

Acum_Normal_Max

0 2000 4000 6000 0.00

Acum_OOD Acum_Normal_Min

0 2000 4000 6000 0.00

Acum_OOD Acum_Normal_Min

0 2000 4000 6000 0.00

Acum_OOD Acum_Normal_Min

0 2000 4000 6000 t

Prediction set size

Acum_Card Acum_Cond_Card

0 2000 4000 6000 t

Acum_Card Acum_Cond_Card

0 2000 4000 6000 t

Acum_Card Acum_Cond_Card

1 0.75 0.5 0.25 0

Figure 5: Trade-off among accumulative metrics when varying the value of λ.

6 Discussions

This work introduces a novel online set-valued prediction framework with out-of-distribution (OOD) detection under bandit feedback. While we provide theoretical guarantees and empirical validation, we acknowledge several modeling assumptions and algorithmic components that deserve further clarification and suggest directions for future work.

Published in Transactions on Machine Learning Research (05/2025)

0 2000 4000 6000 0.00

Hdt_Normal_Max

0 2000 4000 6000 0.00

1.00 CIFAR100

Hdt_Normal_Max

0 2000 4000 6000 0.00

Hdt_Normal_Max

0 2000 4000 6000 0.00

Hdt_OOD Hdt_Normal_Min

0 2000 4000 6000 0.00

Hdt_OOD Hdt_Normal_Min

0 2000 4000 6000 0.00

Hdt_OOD Hdt_Normal_Min

0 2000 4000 6000 t

Prediction set size

Hdt_Card Hdt_Cond_Card

0 2000 4000 6000 t

Hdt_Card Hdt_Cond_Card

0 2000 4000 6000 t

Hdt_Card Hdt_Cond_Card

1 0.75 0.5 0.25 0

Figure 6: Trade-off among metrics for the fixed holdout dataset when varying the value of λ.

First, the theoretical analysis in Theorem 6 relies on Assumption 3, which requires the boundedness of both the importance-weighted estimator t,k and the Lagrange multiplier λt,k. In practice, t,k depends inversely on the action sampling policy πt(k | Xt). To ensure t,k remains bounded, a simple yet effective solution is to clip the policy from below by a constant p, which guarantees t,k max{1/p, 1/(1 Kp)}. This design consideration can be naturally incorporated into the algorithm to ensure numerical stability and theoretical validity.

On the other hand, the multiplier λt,k is dynamically updated to enforce recall constraints for normal classes. While our method adopts a primal-dual approach, its precise theoretical bound is not given in this work. A promising direction is to build upon frameworks from online constrained optimization (Yu et al., 2017; Yu & Neely, 2020), which offer high-probability or expectation-based bounds for Lagrange multipliers under mild conditions. We plan to incorporate these insights into future extensions of Bandit GPS.

Lastly, our current formulation operates in a binary bandit feedback setting, where a single arm is pulled after set-valued prediction. This separates the prediction step (generating a set) from the feedback step (observing a label). An interesting future direction is to unify these steps within a combinatorial bandit framework, where the predicted set itself is treated as the action, and partial or structured feedback is received over the set. Such a generalization could enhance learning efficiency and better reflect real-world applications where multiple predictions are jointly evaluated.

7 Conclusions

In the high-risk scenarios exacerbated by uncertainties from ambiguous instances and the emergence of new classes in dynamic environments, we introduced Bandit GPS, a novel set-valued classification method tailored for online bandit feedback settings. Bandit GPS navigates the uncertainties by offering a set of plausible labels for ambiguous instances and an empty label set for an OOD observation.

Published in Transactions on Machine Learning Research (05/2025)

Bandit GPS handles the challenge of inaccessible label information inherent in bandit feedback by utilizing an estimation of the ground truth of class labels. Leveraging a primal-dual algorithm, Bandit GPS dynamically adjusts tuning parameters in response to its performance history, facilitating an adaptive and automatic risk management mechanism. Furthermore, by enabling the weight between informative prediction and OOD detection, Bandit GPS empowers practitioners to tailor the system based on their specific operational requirements. The theoretical and empirical validations of Bandit GPS affirm its efficacy in managing the challenges of providing informative predictions while accurately detecting OOD instances under the control of class-specific normal recall.

There are limitations in this initial work that warrant further investigation and improvement. Beyond those discussed in Section 6, one promising direction is to integrate Bandit GPS with alternative multi-armed bandit strategies that offer theoretical performance guarantees, such as Thompson Sampling or the Upper Confidence Bound algorithm, to further enhance its decision-making capabilities. Additionally, rather than treating all abnormal inputs as belonging to a single super OOD class, adapting Bandit GPS to support class-incremental learning could enable more refined discovery and assimilation of newly emerging classes in dynamic environments.

Published in Transactions on Machine Learning Research (05/2025)

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. In International Conference on Machine Learning, pp. 242 252. PMLR, 2019.

Anastasios Nikolas Angelopoulos, Stephen Bates, Michael Jordan, and Jitendra Malik. Uncertainty sets for image classifiers using conformal prediction. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=e Ndi U_Db M9.

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235 256, 2002.

Peter L Bartlett and Marten H Wegkamp. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9(Aug):1823 1840, 2008.

Abhijit Bendale and Terrance E Boult. Towards open set deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1563 1572, 2016.

Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in neural information processing systems, 32, 2019.

Yuan Cao and Quanquan Gu. Generalization error bounds of gradient descent for learning over-parameterized deep relu networks. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3349 3356, 2020.

Jun Cen, Di Luan, Shiwei Zhang, Yixuan Pei, Yingya Zhang, Deli Zhao, Shaojie Shen, and Qifeng Chen. The devil is in the wrongly-classified samples: Towards unified open-set recognition. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id= x Lr0I_x YGAs.

Nontawat Charoenphakdee, Zhenghang Cui, Yivan Zhang, and Masashi Sugiyama. Classification with rejection based on cost-sensitive classification. In International Conference on Machine Learning, pp. 1507 1517. PMLR, 2021.

Bertrand Charpentier, Daniel Zügner, and Stephan Günnemann. Posterior network: Uncertainty estimation without ood samples via density-based pseudo-counts. Advances in Neural Information Processing Systems, 33:1356 1367, 2020.

Xinyi Chen, Edgar Minasyan, Jason D Lee, and Elad Hazan. Regret guarantees for online deep control. In Learning for Dynamics and Control Conference, pp. 1032 1045. PMLR, 2023.

Koby Crammer and Claudio Gentile. Multiclass classification with bandit feedback using adaptive regularization. Machine learning, 90(3):347 383, 2013.

Christophe Denis and Mohamed Hebiri. Confidence sets with expected sizes for multiclass classification. The Journal of Machine Learning Research, 18(1):3571 3598, 2017.

Christophe Denis and Mohamed Hebiri. Consistency of plug-in confidence sets for classification in semisupervised learning. Journal of Nonparametric Statistics, 32(1):42 72, 2020.

Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes overparameterized neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=S1e K3i09YQ.

Sreenivas Gollapudi, Guru Guruganesh, Kostas Kollias, Pasin Manurangsi, Renato Leme, and Jon Schneider. Contextual recommendations and low-regret cutting-plane algorithms. Advances in Neural Information Processing Systems, 34:22498 22508, 2021.

Published in Transactions on Machine Learning Research (05/2025)

Leying Guan and Robert Tibshirani. Prediction and outlier detection in classification problems. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 84(2):524, 2022.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Yotam Hechtlinger, Barnabás Póczos, and Larry Wasserman. Cautious deep learning. ar Xiv preprint ar Xiv:1805.09460, 2018.

Radu Herbei and Marten H Wegkamp. Classification with reject option. The Canadian Journal of Statistics/La Revue Canadienne de Statistique, pp. 709 721, 2006.

Tianyuan Jin, Pan Xu, Jieming Shi, Xiaokui Xiao, and Quanquan Gu. Mots: Minimax optimal thompson sampling. In International Conference on Machine Learning, pp. 5074 5083. PMLR, 2021.

Sham M Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Efficient bandit algorithms for online multiclass prediction. In Proceedings of the 25th international conference on Machine learning, pp. 440 447, 2008.

Jihyo Kim, Jiin Koo, and Sangheum Hwang. A unified benchmark for the unknown detection capability of deep neural networks. Expert Systems with Applications, 229:120461, 2023.

Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985.

Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-ofdistribution samples and adversarial attacks. Advances in neural information processing systems, 31, 2018.

Yaniv Romano, Matteo Sesia, and Emmanuel Candes. Classification with valid and adaptive coverage. Advances in Neural Information Processing Systems, 33:3581 3591, 2020.

Mauricio Sadinle, Jing Lei, and Larry Wasserman. Least ambiguous set-valued classifiers with bounded error levels. Journal of the American Statistical Association, 114(525):223 234, 2019.

Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(3), 2008.

Samuel Stanton, Wesley Maddox, and Andrew Gordon Wilson. Bayesian optimization with conformal prediction sets. In International Conference on Artificial Intelligence and Statistics, pp. 959 986. PMLR, 2023.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

Muhammad Faaiz Taufiq, Jean-Francois Ton, Rob Cornish, Yee Whye Teh, and Arnaud Doucet. Conformal off-policy prediction in contextual bandits. Advances in Neural Information Processing Systems, 35: 31512 31524, 2022.

William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933.

Dirk van der Hoeven, Federico Fusco, and Nicolò Cesa-Bianchi. Beyond bandit feedback in online multiclass classification. Advances in neural information processing systems, 34:13280 13291, 2021.

Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic learning in a random world. Springer Science & Business Media, 2005.

Vladimir Vovk, Ilia Nouretdinov, Valentina Fedorova, Ivan Petej, and Alex Gammerman. Criteria of efficiency for set-valued classification. Annals of Mathematics and Artificial Intelligence, 81(1):21 46, 2017.

Published in Transactions on Machine Learning Research (05/2025)

Shijun Wang, Rong Jin, and Hamed Valizadegan. A potential-based framework for online multi-class learning with partial feedback. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 900 907. JMLR Workshop and Conference Proceedings, 2010.

Wenbo Wang and Xingye Qiao. Learning confidence sets using support vector machines. Advances in Neural Information Processing Systems, 31, 2018.

Wenbo Wang and Xingye Qiao. Set-valued support vector machine with bounded error rates. Journal of the American Statistical Association, pp. 1 13, 2022a.

Zhou Wang and Xingye Qiao. Learning acceptance regions for many classes with anomaly detection. ar Xiv preprint ar Xiv:2209.09963, 2022b.

Zhou Wang and Xingye Qiao. Set-valued classification with out-of-distribution detection for many classes. Journal of Machine Learning Research, 24(375):1 39, 2023.

Zhou Wang and Xingye Qiao. Deep generalized prediction set classifier and its theoretical guarantees. Transactions on Machine Learning Research, 2024a.

Zhou Wang and Xingye Qiao. Efficient online set-valued classification with bandit feedback. In International Conference on Machine Learning, pp. 51328 51347. PMLR, 2024b.

Guoxuan Xia and Christos-Savvas Bouganis. Augmenting softmax information for selective classification with out-of-distribution data. In Proceedings of the Asian Conference on Computer Vision, pp. 1995 2012, 2022.

Pan Xu, Zheng Wen, Handong Zhao, and Quanquan Gu. Neural contextual bandits with deep representation and shallow exploration. In International Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=xn YACQqua GV.

Jingkang Yang, Kaiyang Zhou, Yixuan Li, and Ziwei Liu. Generalized out-of-distribution detection: A survey. ar Xiv preprint ar Xiv:2110.11334, 2021.

Hao Yu and Michael J Neely. A low complexity algorithm with O(

T) regret and O(1) constraint violations for online convex optimization with long term constraints. Journal of Machine Learning Research, 21(1): 1 24, 2020.

Hao Yu, Michael Neely, and Xiaohan Wei. Online convex optimization with stochastic constraints. Advances in Neural Information Processing Systems, 30, 2017.

Chong Zhang, Wenbo Wang, and Xingye Qiao. On reject and refine options in multicategory classification. Journal of the American Statistical Association, 113(522):730 745, 2018.

Weitong Zhang, Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural thompson sampling. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=tk Ato Zkc Unm.

Yingying Zhang, Chengchun Shi, and Shikai Luo. Conformal off-policy prediction. In International Conference on Artificial Intelligence and Statistics, pp. 2751 2768. PMLR, 2023.

Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pp. 11492 11502. PMLR, 2020.

Fei Zhu, Zhen Cheng, Xu-Yao Zhang, and Cheng-Lin Liu. Openmix: Exploring outlier samples for misclassification detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12074 12083, 2023.

Published in Transactions on Machine Learning Research (05/2025)

A.1 Experiments Compute Resources

All experiments are conducted on an NVIDIA P100 GPU with CUDA 11.3. The empirical metrics presented in all figures are computed and averaged over 4 runs, with the shaded regions representing the standard error.

A.2 Derive for Normal Recall Control

E 1{f k Wt(Xt) 0} | Yt = k 1 γ

E 1 1{f k Wt(Xt) 0} γ | Yt = k 0

E 1{f k Wt(Xt) < 0} γ | Yt = k 0

E 1{Yt = k} 1{f k Wt(Xt) < 0} γ

P[Yt = k] 0

E 1{Yt = k} 1{f k Wt(Xt) < 0} γ 0

Proof of Theorem 5: From the updating rule Equation (8), we have

λt+1,k = λt,k + η2 t,kℓ2,γ(f k Wt(Xt))

+ λt,k + η2 t,kℓ2,γ(f k Wt(Xt))

t=1 t,kℓ2,γ(f k Wt(Xt)) 1

t=1 λt+1,k λt,k λT +1,k

By taking the expectation for both sides of the last inequality, we have

t=1 E[ t,kℓ2,γ(f k Wt(Xt))] Cλ

t=1 E E[ t,kℓ2,γ(f k Wt(Xt)) | {πs}s<t, {(Xt, Yt)}s t, W1] Cλ

t=1 E 1{Yt = k} ℓ2,γ(f k Wt(Xt)) Cλ

t=1 E ℓ2,γ(f k Wt(Xt)) | Yt = k Cλ

t=1 E 1{f k Wt(Xt) < 0} γ | Yt = k Cλ

t=1 P f k Wt(Xt) < 0] | Yt = k Tγ + Cλ

where (9) holds due to the (conditional) unbiased estimator t,k; (10) holds due to the Bayes theorem, and (11) holds due to the substitution with the surrogate hinge loss as mentioned in (6). Therefore, the last inequality concludes

t=1 P [Yt ϕt(Xt) | Yt = k] 1 γ Cλ Tη2pk .

Published in Transactions on Machine Learning Research (05/2025)

Our derivation for Theorem 6 highly relies on some commonly used lemmas and propositions in the overparameterized neural network literature. We skip their proofs to avoid redundancy. We leave their details to the main literature Allen-Zhu et al. (2019); Cao & Gu (2019; 2020) for the reader s interest.

Additionally, consider two sequences {an} and {bn}. We employ some asymptotic notations as follows: an = O(bn) if lim supn | an

bn | < ; an = O(bn) to indicate an additional logarithmic factor within O( ); and an = Ω(bn) if lim infn | an

For a fixed instance Xt, the neural network can be re-written in the form of matrix products as below:

f W(Xt) = WLDt,L 1WL 1Dt,L 2 Dt,l Wl Dt,l 1 Dt,1W1Xt. (12)

We define ht,l := Dt,l Wl 1DL 2 Dt,1W1Xt, l [L 1], where

Dt,l = Diag 1{(Wlht,l 1)1 0}, , 1{(Wlht,l 1)m 0} .

Thus, the gradient can be written as

Wlf k W(Xt) =

ekh t,L 1 if l = L ht,l 1e k WL

r=l+1 Dt,r Wr

if l [L 1] (13)

where ek RK is the column vector with value 1 in k-th entry and 0 otherwise. When W in Equation (12) is particularly set as the initialization W1, the associated Wl, Dt,l in Equation (12) will be denoted by W1 l , D1 t,l accordingly.

Lemma 7. If K O( m L log(m)), with the probability at least 1 exp( Ω(mω2/3L log(m))) over the randomness of W1 l , l [L], for any perturbation matrices W l , l [L] with W l 2 ω = O(L 1.5), any diagonal matrices D t,l [ 3, 3]m m, l [L 1] with at most O(mω2/3L) non-zero entries, we have

W1 l + W l D1 t,l 1 + D t,l 1

l=a+1 W1 l D1 t,l 1

2 O ω1/3L2p

Proof of Lemma 7:

W1 l + W l D1 t,l 1 + D t,l 1

l=a+1 W1 l D1 t,l 1

l=a W1 LD1 t,L 1 W1 l+1D t,l W1 l + W l D1 t,a + D t,a

l=a+1 W1 LD1 t,L 1 W1 l+1D1 t,l W l D1 t,l 1 + D t,l 1 W1 l 1 + W l 1 D1 t,a + D t,a 2

W1 LD1 t,L 1 W1 l+1D0/1 t,l 2 | {z } Allen Zhuet al. (2019,Lemma 7.4(a))

D t,l 2 D0/1 t,l W1 l + W l D1 t,a + D t,a 2 | {z } Allen Zhuet al. (2019,Lemma 8.6(b))

W1 L 1 W1 l+1D1 t,l 2 | {z } Allen Zhuet al. (2019,Lemma 7.4(b))

W l 2 D1 t,l 1 + D t,l 1 W1 l 1 + W l 1 D1 t,a + D t,a 2 | {z } Allen Zhuet al. (2019,Lemma 8.6(b))

mω2/3L log(m)

L) + L O rm

= O ω1/3L2p

+ O ωL3/2 rm

= O ω1/3L2p

Published in Transactions on Machine Learning Research (05/2025)

where the structure of D0/1 t,l is same as the one in Allen-Zhu et al. (2019). We direct the reader to the corresponding literature for any interest.

Lemma 8. (Cao & Gu, 2019, Lemma D.1) If ω O(L 9/2[log(m)] 3/2), then with probability at least 1 O(TL) exp Ω(mω2/3L) over W1, for all W, f W B(W1, ω), t [T] and l [L 1],

Dt,l e Dt,l 0 O(mω2/3L).

We will use Lemmas 7 and 8 to prove the below Theorem 9. Theorem 9. Let B(W1, ω) := {W : Wl W1 l 2 ω, l [L]} be an ω-neighborhood of the initialization. There exists an absolute constant κ such that, with probability at least 1 O(TL2) exp( Ω(mω2/3L)) over the randomness of W1, for all t [T], and W, W B(W1, ω) with ω κL 6[log(m)] 3/2, it holds uniformly that f W(Xt) f e W(Xt) e Wf e W(Xt), W f W 2 O ω4/3L3p

Proof of Theorem 9: For W, f W B(W1, ω) := {W : Wl W1 l 2 ω, l [L]}, we have

f W(xt) f e W(xt) e Wf e W(xt), W f W

= WLht,L 1 f WL ht,L 1 ht,L 1, WL f WL

r=l+1 e Dt,r f Wr

e Dt,l(Wl 1 f Wl 1) ht,l 1

= WL(ht,L 1 ht,L 1)

r=l+1 e Dt,r f Wr

e Dt,l(Wl 1 f Wl 1) ht,l 1.

Similar to Claim 11.2 in Allen-Zhu et al. (2019), there exists D i,l such that

ht,L 1 ht,L 1 =

r=l+1 (Dt,r + D t,r)Wr

(Dt,l + D t,l)(f Wl 1 Wl 1) ht,l 1,

where diagonal matrices D t,l, l [L 1] has entries [ 1, 1] such that D t,l 0 O(mω2/3L). Therefore, f W(xt) f e W(xt) e Wf e W(xt), W f W 2

r=l+1 Wr(Dt,r 1 + D t,r 1) (Wl 1 f Wl 1) ht,l 1

r=l+1 f Wr e Dt,r 1

(Wl 1 f Wl 1) ht,l 1

r=l+1 Wr(Dt,r 1 + D t,r 1)

r=l+1 f Wr e Dt,r 1

l=1 Wl 1 f Wl 1 2 ht,l 1 2

1 O ω1/3L2p

ωL = O ω4/3L3p

where 1 is due to

r=l+1 Wr(Dt,r 1 + D t,r 1)

r=l+1 f Wr e Dt,r 1

r=l+1 Wr(Dt,r 1 + D t,r 1)

r=l+1 W1 r D1 t,r 1

r=l+1 f Wr e Dt,r 1

r=l+1 W1 r D1 t,r 1

2 O ω1/3L2p

Published in Transactions on Machine Learning Research (05/2025)

where 2 holds by using Lemma 7. This is because f W, W B(W1, ω). By Lemma 8, we have e Dt,r 1 D1 t,r 1 0 O(mω2/3L) and Dt,r 1 D1 t,r 1 0 O(mω2/3L). Hence Dt,r 1 + D t,r 1 D1 t,r 1 0 Dt,r 1 D1 t,r 1 0 + D t,r 1 0 O(mω2/3L) for any r [L 1].

The below Proposition 10 is directly derived based on Theorem 9 by utilizing the convexity and Lipchitzness of a given function ℓ. This proposition implies ℓis near convex with respect to the neural network parameter W, although the neural network is not convex.

Proposition 10. Let the ℓ: RK R be a convex and c-Lipschitz function. There exists an absolute constant κ such that, with probability at least 1 O(TL2) exp( Ω(mω2/3L)) over the randomness of W1, for all t [T], and W, W B(W1, ω) with ω κL 6[log(m)] 3/2, it holds uniformly that

ℓ(f W (Xt)) ℓ(f W(Xt)) Wℓ(f W (Xt)), W W + O cω4/3L3p

Proof of Proposition 10:

ℓ(f W (Xt)) ℓ(f W(Xt))

ℓ (f W (Xt)) (f W(Xt) f W (Xt)) b/c ℓis convex

= ℓ (f W (Xt)) (f W(Xt) f W (Xt) Wf W (Xt), W W )

+ ℓ (f W (Xt)) Wf W (Xt), W W

+ Wℓ(f W (Xt)), W W , (14)

where the last inequality holds due to the Lipschitzness of ℓ( ) and the usage of Theorem 9.

The below Theorem 11 shows the boundness for the gradient of the neural network with respect to the parameters W.

Theorem 11. There exists an absolute constant κ such that, with probability at least 1 O(TL2) exp( Ω(mω2/3L)) over the randomness of W1, for all t [T], l [L], and W B(W1, ω) with ω κL 6[log(m)] 3, it holds uniformly that Wf W(Xt) 2 = O K1/2ω1/3L5/2p

Proof of Theorem 11: This theorem is adapted from Cao & Gu (2019, Lemma B.3). From Equation (13) we have

WLf W(xt) 2 2 =

k=1 WLf k W(xt) 2 2 =

k=1 ekh t,L 1 2 2 = O(K),

Published in Transactions on Machine Learning Research (05/2025)

and for l [L 1],

Wlf W(xt) 2 =

k=1 ht,l 1e k WL

r=l+1 Dt,r Wr

k=1 ht,l 1 2 e k WL

r=l+1 Dt,r Wr

r=l+1 Wr Dt,r 1 2

r=l+1 Wr Dt,r 1 2

r=l+1 Wr Dt,r 1

r=l+1 W1 r D1 t,r 1 2 +

r=l+1 W1 r D1 t,r 1 2

mω2/3L log(m)

= O K1/2ω1/3L2p

m log(m) O(

where (15) holds due to Lemma 7 and Lemma 7.4(a) in Allen-Zhu et al. (2019), and the last equality holds due to the upper bound of ω. Consequently,

Wf W(Xt) 2 = O K1/2ω1/3L5/2p

Proof of Theorem 6: Note that the hinge loss ℓ1 is a convex 1-Lipschitz function. It is clear to see that ℓ(u) is convex with respect to u RK. To show its Lipschitzness, take u, u RK. Thus,

|ℓ(u) ℓ( u)|

k=1 (ℓ1(uk) ℓ1( uk)) + (1 λ)(ℓ1(max k [K] uk) ℓ1(max k [K] uk))

k=1 (ℓ1(uk) ℓ1( uk)) + (1 λ) max k [K](uk uk)

K u u 2 + (1 λ) u u 2

K + 1 λ) u u 2

implies ℓ( ) is a function with Lipschitz constant λ

K + 1 λ, and hence ℓ (u) 2 λ

K + 1 λ. Additionally, assume λt,k Cλ, t,k C for all t [T], k [K], thus L(X; W) is Lipschitz mapping with respect to f W(X) with Lipschitz constant λ

Recall the risk function then can be rewritten as

ℓ(f Wt(Xt)) +

k=1 λt,k t,kℓ2,γ(f k Wt(Xt)),

Published in Transactions on Machine Learning Research (05/2025)

and the updating rule is

Wt+1 = Wt η1

Wℓ(f Wt(Xt)) +

k=1 λt,k t,k Wℓ2,γ(f k Wt(xt)) .

For notation simplicity, substitute ω in the first term of (14) as R m and let

K + 1 λ)R4/3L3p

Therefore, for W B(W1, R m), we have W W1 2 R

t=1 ℓ(f Wt(xt)) 1

t=1 ℓ(f W (xt))

t=1 Wℓ(f Wt(xt)), Wt W + ν b/c of Proposition 10

t=1 Wt Wt+1, Wt W + 1

k=1 λt,k t,k Wℓ2,γ(f k Wt(xt)), W Wt + ν

Wt Wt+1 2 2 + Wt W 2 2 Wt+1 W 2 2

k=1 λt,k t,k Wℓ2,γ(f k Wt(xt)), W Wt + ν

η2 1 2η1 ( λ

KCλC )2 O KLm + ω2L

t=1 CλC |ℓ 2,γ( )|

K Wtf Wt(Xt) 2

KCλC )2 O KLm + ω2L

KLω K1/2ω1/3L5/2p

m log(m) + ν,

KCλC )2 O η1

2 KLm + LR2

+ CλC O KL3R4/3p

log(m) m1/6

K + 1 λ)R4/3L3p

where the last inequality is due to telescoping sum, Proposition 10, and the below fact

Wt Wt+1 2 2 = η2 1

f Wt(Xt)L(Xt, Wt)

2 Wtf Wt(Xt)

KCλC )2 O Kω2/3L5m log(m) , (17)

where (17) holds due to the Lipschitzness of the function L(Xt; Wt) with respect to f Wt(Xt) and the usage of (16).

Published in Transactions on Machine Learning Research (05/2025)

Thus, by setting a constant c := ( λ

KCλC )2 + λ

K + KCλC and choosing η1 = R

KT m, we have

t=1 ℓ(f Wt(Xt)) 1

t=1 ℓ(f W (Xt)) c O LR

T + L3R4/3p

log(m) m1/6