# constrained_binary_decision_making__2c239925.pdf Constrained Binary Decision Making Daniel Pr uša Vojtˇech Franc Department of Cybernetics Faculty of Electrical Engineering Czech Technical University in Prague {prusapa1,xfrancv}@fel.cvut.cz Binary statistical decision making involves choosing between two states based on statistical evidence. The optimal decision strategy is typically formulated through a constrained optimization problem, where both the objective and constraints are expressed as integrals involving two Lebesgue measurable functions, one of which represents the strategy being optimized. In this work, we present a comprehensive formulation of the binary decision making problem and provide a detailed characterization of the optimal solution. Our framework encompasses a wide range of well-known and recently proposed decision making problems as specific cases. We demonstrate how our generic approach can be used to derive the optimal decision strategies for these diverse instances. Our results offer a robust mathematical tool that simplifies the process of solving both existing and novel formulations of binary decision making problems which are in the core of many Machine Learning algorithms. 1 Introduction Binary statistical decision making (BDM) involves selecting between two possible states using statistical evidence. This approach is applied across various fields, from classical statistics to machine learning. A notable example is the Neyman-Pearson problem [18], which establishes the optimal strategy for testing two hypotheses and serves as the foundation for optimal detectors. Beyond classical statistics, BDM problems are used in many machine learning applications, such as detecting out-of-distribution (OOD) samples [11, 14, 6, 5, 15, 10, 2, 21, 20, 22, 23], designing selective classifiers [13, 12, 4, 8, 9], and developing selective classification in the presence of OOD data (SCOD) [24, 16, 7] to name a few. Instances of BDM determine the best decision-making strategy by solving a constrained optimization problem involving Lebesgue measurable functions in both the objective and constraints. Understanding the structure of this optimal strategy is crucial before deploying efficient methods to find it. The Neyman-Pearson lemma, a well-known example found in many statistics textbooks, states that an optimal detection strategy relies on comparing the likelihood ratio of the underlying distributions with a threshold. Therefore, learning a detector from examples involves effectively approximating the likelihood ratio and then tuning the decision threshold. However, defining the optimal strategy for various BDM problems is generally challenging, often taking several years to discover. For instance, the Bounded-Improvement and Bounded-Abstention models, serving as formal definitions of optimal selective classifiers, were first introduced in [19]. The optimal strategies for these two BDM problems were initially described in [8], marking a 14-year period of development. Another recent example is the work of [24], who developed a method for learning a selective classifier handling inputs contaminated by out-of-distribution data, so called SCOD problem. Although they formulated an optimal SCOD selection strategy to accept input samples as a solution to a novel BDM problem, they did not analyze the optimal strategy directly. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Instead, they proposed a selector strategy named SIRC, combining two detectors using heuristically chosen non-linear functions. In a recently published paper [16], the authors derived an optimal solution for the SCOD problem, demonstrating that the optimal selective strategy involves a simple linear combination of the two detectors. Empirical evidence confirmed that the optimal linear strategy consistently outperforms the heuristic SIRC, despite its minor change. This example underscores the advantages of understanding the structure of the optimal solution to underlying BDM problems. In this paper, we present a comprehensive formulation of the BDM problem and thoroughly characterize the optimal strategy. Our framework encompasses various BDM problems as special cases, enabling us to derive optimal decision strategies for these instances. This provides a robust mathematical tool for solving both existing and new BDM problems. The related theorem is highly general, applying to both discrete and continuous instance spaces without requiring the differentiability of decision and loss functions, unlike common proof techniques based on Lagrange duality. To illustrate the impact of our work, in Section 2, we formulate several practical modifications of existing BDM problems. In Section 4, we demonstrate the derivation of their optimal strategies using our main result, which characterizes the optimal strategies of a generic BDM as presented in Section 3. 2 Examples of existing binary decision making problems In this section, we present various examples of BDM problems, with a focus on the significant application of designing a selective classifier in the presence of out-of-distribution (SCOD) samples [24, 16]. We note that while selective classification itself is not the contribution of this paper, it serves to illustrate the practical importance and diversity of BDM formulations in addressing a recently emerging problem in the machine learning community. Consider a given classifier h: X Y, where X is the instance space and Y a finite set of labels. We assume the classifier h was designed to minimize the prediction loss ℓ: Y Y R. Our task is to equip the classifier h with a binary stochastic decision strategy c: X [0, 1]. The strategy c acts as a selector for input samples. The input sample x X is accepted for prediction by h with probability c(x) and is rejected otherwise. Thus, the pair (h, c): X Y {reject} forms a reject option classifier [3]: (h, c)(x) = h(x) with probability c(x) , reject with probability 1 c(x) . We assume the following statistical model of the data. Input samples are generated from a mixture of two distributions: the in-distribution (ID) p I : X R+ and the out-of-distribution (OOD) p O : X R+. Thus, the input sample x X is generated from p(x) = p I(x) (1 π) + p O(x) π , (1) where π [0, 1] represents the portion of OOD samples in the mixture. Furthermore, if the sample x Y is generated from the in-distribution, a latent label y Y is generated according to the distribution p(y | x). In the outlined statistical model, prediction uncertainty arises from two main sources: i) inherent data uncertainty (aleatoric uncertainty), stemming from the stochastic relationship between the input x and the latent label y, and ii) distribution uncertainty, which arises from input samples potentially coming from either in-distribution or out-of-distribution sources. We will explore six different BDM instances to develop a selector c that admits an input sample x for classification with h only when the prediction uncertainty is below an acceptable threshold. Each instance defines an optimal selector in a unique way. In Section 2.1, we described the Neyman-Pearson formulation of the OOD optimal detector [18], which addresses only distribution uncertainty. Sections 2.2 and 2.3 cover the Bounded-Improvement and Bounded-Abstention models [19], addressing only data uncertainty. In Section 2.4, we describe the SCOD problem [24], which defines the optimal selector addressing both distribution and data uncertainty. Additionally, Sections 2.5 and 2.6 give examples of two novel practical variants of the SCOD problem. The primary contribution of this paper is a framework for solving a broad range of BDM problems. In Section 3, we detail the main theorem that characterizes the optimal solution for a generic BDM problem. In Section 4, we demonstrate how to apply this theorem to find solutions easily for all the BDM instances described here. 2.1 Neyman-Pearson (NP) detector Assume our goal is to design a selector strategy c: X [0, 1] that accepts only in-distribution (ID) samples and rejects out-of-distribution (OOD) samples. In other words, c functions as a binary classifier to distinguish ID and OOD samples. We will characterize the strategy c using two metrics. First, using the True Positive Rate (TPR) representing the probability that ID sample x is correctly accepted by the strategy: X p I(x) c(x) dx . (2) Second, using the False Positive Rate (FPR) representing the probability that OOD sample x is incorrectly accepted by c: X p O(x) c(x) dx . (3) Given maximal acceptable FPR, fprmax > 0, the NP task is to find a strategy c that solves the following BDM problem [18]: max c: X [0,1] tpr(c) s.t. fpr(c) fprmax . (4) A well-known solution to the NP task 4 is the strategy: ( 1 if g(x) < θ , τ if g(x) = θ , 0 if g(x) > θ , (5) where g(x) = p O(x) p I(x) is the likelihood ratio, θ R is a decision threshold, and τ [0, 1] is a probability of acceptance when g(x) = θ. Knowing the structure of the optimal strategy simplifies its construction significantly. The key to an optimal decision is accurately modeling the likelihood ratio g(x). Once the likelihood ratio is known, finding the optimal strategy involves setting the threshold θ and the randomization parameter τ, typically done by tuning these parameters on a validation sample. Moreover, if p I(x) and p O(x) are continuous, the randomization parameter τ can be arbitrary since the event g(x) = θ has zero probability of occurring. 2.2 Selective Classification: Bounded-abstention model Let us consider the ideal scenario where the input instances x X are solely generated by the in-distribution p I(x), meaning the out-of-distribution portion in the model (1) is π = 0. Our aim is devise a selection strategy c: X [0, 1] that accepts only those input samples x X where the prediction loss ℓ(y, h(x)) is likely to be small. We will characterize the strategy c using two metrics. First, we employ the notion of coverage: X p I(x) c(x) dx (6) which reflects the probability of accepting an input sample. Secondly, we introduce the concept of "selective risk": y Y p I(x, y) ℓ(y, h(x)) c(x) dx representing the expected prediction loss ℓ: Y Y R on the accepted samples. Given a minimal acceptable coverage ω > 0, the bounded-abstention model defines the optimal selector strategy as the solution to the following BDM problem [19]: min h YX ,c [0,1]X RS(h, c) s.t. ϕ(c) ω . (8) The solution to the bounded-abstention model (8) consists of the Bayes predictor h (x) = argminy X y p I(y | x)ℓ(y, y ) and the strategy [8, 9]: ( 1 if r(x) < θ , τ if r(x) = θ , 0 if r(x) > θ , (9) where r(x) = X y Y p I(y | x) ℓ(y, h(x)) (10) is the conditional risk of the classifier h for the input x, θ R is a decision threshold, and τ [0, 1] is a probability of acceptance when r(x) = 0. Note that the BDM problem (8) was formulated in [19], but the optimal solution (9) was not derived until 14 years later in [8]. The values of θ and τ depend on p I(x, y) and ω, and in practice, they are tuned on validation data once a good model of the conditional risk r(x) is established. While this process can be challenging and needs be addressed for each problem separately, it is still significantly easier than tuning an unknown function. For example, discretizing the possible values of θ and τ and conducting an exhaustive search, although not always optimal, has proven effective in practice. The additivity of the risk RS allows h to be optimized independently of c, resulting in the Bayes predictor for all problems discussed in the following sections. Therefore, from this point on, we will focus solely on optimal strategies for c. 2.3 Selective Classification: Bounded-improvement rejection model A symmetric definition of the optimal selector strategy is provided by the bounded-improvement model. Given a maximum acceptable selective risk λ > 0, the optimal selector is the solution to the following BDM problem: max h,c ϕ(c) s.t. RS(h, c) λ . (11) The solution to the bounded-improvement model (11) is similar to the strategy (9), differing only in the specific values of the decision threshold θ and the randomization parameter τ [9]. 2.4 Selective Classification in the presence of OOD data (SCOD) The SCOD problem integrates the objectives of the Neyman-Pearson task and bounded-abstention models. The aim is to design a selector strategy c X [0, 1] that accepts an input sample x if it is likely to be correctly predicted by h and unlikely to be generated from the OOD. Formally, the objective is defined using three metrics: True Positive Rate tpr(c), equation (2), False Positive Rate fpr(c), equation (3) and selective risk RS(h, c), equation (7). Given the relative cost α > 0 and the minimum acceptable TPR, tprmin > 0, the goal is to find a strategy c that solves the following BDM problem [24]: min h YX ,c [0,1]X (1 α) RS(h, c) + α fpr(c) s.t. tpr(c) tprmin . (12) The solution to the SCOD problem (12) is the strategy [16, 7]: ( 1 if s(x) > θ , τ if s(x) = θ , 0 if s(x) < θ where s(x) = r(x) + α tprmin 1 α g(x) (13) for α [0, 1) and s(x) = g(x) for α = 1. Here, r(x) is the conditional risk (10), g(x) = p O(x) p I(x) is the likelihood ratio, θ R is a decision threshold, and τ [0, 1] the acceptance probability for the case s(x) = r(x). The BDM problem (12) was formulated in [24]. The optimal solution for continuous distributions was recently derived in [16], and the solution for the general case was provided in [7]. 2.5 SCOD: bounded TPR-FPR To demonstrate the versatility of our framework, we will present two novel variants of the SCOD problem in in this section and the next. While the optimal solutions to these new formulations are unknown, they can be easily derived using the framework proposed in this paper, as we will later show. The original formulation of the SCOD problem (12) relies on the user-defined relative cost α. However, this assumes that the prediction loss ℓ: Y Y R and the cost for incorrectly accepting the OOD sample share the same units, which is not always practical. In such cases, it can be useful to remove the FPR from the objective and impose a hard constraint on it instead. Specifically, we can alternatively formulate the SCOD problem as follows. Given tprmin > 0 and fprmax > 0, the goal is to find a strategy c that solves the following BDM problem: min h YX ,c [0,1]X RS(h, c) s.t. tpr(c) tprmin and fpr(c) fprmax . (14) In Section 4, we will prove that the optimal solution to the problem (14) is the strategy: ( 1 if s(x) < θ , τ if s(x) = θ , 0 if s(x) > θ , where s(x) = r(x) + βg(x) , (15) with θ R as the decision threshold, τ [0, 1] as the acceptance probability when s(x) = θ, and β R as a constant depending on the problem setup. 2.6 SCOD: bounded Precision-Recall Assume the portion of OOD samples π is known or can be estimated. In some applications, it is useful to replace FPR with precision, which is the proportion of true ID samples among all accepted inputs. Precision prec(c) is defined as: prec(c) = (1 π) tpr(c) fpr(c) π + tpr(c) (1 π) . If we want to avoid defining the relative cost, as in (14), we can formulate the SCOD problem as follows. Given tprmin > 0 and precmin > 0, the goal is to find a strategy c which solves the following BDM problem: min h YX ,c [0,1]X RS(h, c) s.t. tpr(c) tprmin and prec(c) precmin (16) In Section 4, we will prove that the optimal solution to the problem (16) is a strategy similar to (15), differing only in the specific values of the decision threshold τ, the randomization parameter τ and the multiplier β. 2.7 Summary In the previous sections, we described six BDM problems, each with a unique optimal selector strategy. The first four are established formulations, while the last two are novel SCOD modifications potentially useful for specific applications. As ML applications grow, new formulations are likely to emerge to address specific setups. Deriving the optimal strategy for a given BDM problem proved to be the essential first step. Although the statistical model is often unknown in practice, understanding the structure of the optimal strategy is crucial for designing efficient algorithms to learn the selector from examples. This knowledge can narrow the search space or simplify the problem using divide-and-conquer approaches. For instance, consider constructing a selector strategy for the SCOD problem. Whether using the original cost-based formulation (12) or the two novel variants (14) and (16), takes the form (15), the optimal strategy c relies on an uncertainty score s(x) = r(x) + βg(x). The strategy accepts input x if s(x) is below a threshold θ, and with probability τ if s(x) = θ. This analysis shows that the core problem reduces to approximating the two components: the conditional risk y Y p(y | x)ℓ(y, h(x)) and the OOD/ID likelihood ratio g(x) = p O(x)/p I(x). Extensive literature on OOD detection provides methods to approximate g (e.g. [11, 14, 6, 5, 15, 10, 2, 21, 20, 22, 23]), and there are methods to construct good proxies for r (e.g. [13, 12, 4, 8, 9]). Once r and g are known, constructing the selector strategy involves determining the scalars θ, τ, and β. These parameters can be tuned using held-out data, which is much easier than solving the original problem. Moreover, for continuous distributions, the event s(x) = θ has zero probability, making τ arbitrary. In the cost-based formulation, the multiplier β has an analytical solution, leaving only θ to be determined. This example shows that deriving an optimal strategy for BDM problems allows us to use existing methods from OOD detection and selective classification to approximate various SCOD problem formulations. Conversely, skipping the optimal strategy derivation and using heuristic rules, such as the SIRC strategy from the original SCOD paper [24], can lead to sub-optimal performance [16]. 3 Main result In this section, we formulate a generic BDM problem, characterize its optimal solutions, and provide a straightforward instance of the optimal strategy. For a measurable set X in a measure space, Lebesgue measurable functions R, p, q : X [0, ) with finite integrals on X, and fixed non-negative real values δ and σ, we define the following optimization problem: min c: X [0,1] X R(x)c(x)dx s.t. Z X p(x)c(x)dx δ and Z X q(x)c(x)dx σ . (17) Note that we do not pose any additional requirements on the functions, like continuity or differentiability, neither do we differentiate between discrete and continuous sets X; they can be arbitrary. Theorem 1. For every optimal solution c : X [0, 1] to Problem (17), there exist real numbers λ, µ such that Z X < p(x)c (x) dx = Z X < p(x) dx , (18) X > p(x)c (x) dx = 0 , (19) X < = x X | p(x) > 0 R(x) p(x) + µq(x) p(x) < λ , (20) X > = x X | p(x) > 0 R(x) p(x) + µq(x) p(x) > λ . (21) Informally, equation (18) states that any optimal c attains the value 1 on X <, except on a subset of measure zero. Similarly, by equation (19), c attains the value 0 on X >, again up to a subset of measure zero. The sets X < and X > contain the points x for which R(x) p(x) + µ q(x) p(x) is below and above the threshold λ, respectively, excluding any insignificant elements where p(x) = 0. Based on these observations, the next lemma suggests a way to construct a single score function that determines an optimal decision strategy. Lemma 1. An optimal solution c : X [0, 1] to Problem (17) can be found among the forms ( 0 if s(x) > λ , τ(x) if s(x) = λ , 1 if s(x) < λ , s(x) = R(x) p(x) + µ q(x) p(x) if p(x) > 0 , otherwise, is a score function, λ, µ are suitable real numbers, and τ : X [0, 1] is a function implicitly defined by the problem parameters. Proof. Let c : X [0, 1] be an optimal solution to Problem (17). Consider real numbers λ, µ yielded by Theorem 1 and define ( 0 if s(x) > λ , c(x) if s(x) = λ , 1 if s(x) < λ . To show that c is optimal, decompose X into the pairwise disjoint sets X < (see (20)), X > (see (21)), X = {x X | p(x) = 0} and X = = {x X | s(x) = λ}. Equalities (18) and (19) ensure that c and c are nearly identical on sets X < and X >, respectively, up to subsets of measure zero. By the definition of c , they are identical on X =. Consequently, the criterion and constraints attain the same values for both c and c on X < X > X =. Finally, the first constraint of (17) is independent of X , and c (x) c(x) for all x X ensures Z X R(x)c (x) Z X R(x)c(x) , Z X q(x)c (x) Z X q(x)c(x) . In conclusion, c is an optimal solution. Tool to solve the BDM problems Lemma 1 is a key tool for simplifying the BDM problem (17). It shows that the main challenge is approximating the ratios R(x) p(x) and q(x) p(x). Once these ratios are known, determining the score s involves finding the multiplier µ. With the score s, the optimal strategy c depends on finding the decision threshold λ and the function τ(x). However, in the cases where X is a continuous space, the boundary set {x X | s(x) = λ} typically has measure zero, except in some pathological instances. Consequently, τ(x) can be chosen arbitrarily and does not play a significant role. Thus, in continuous spaces, one only needs to find the scalars µ and λ, which can be tuned using held-out data, making the process much simpler than solving the original problem from scratch. The problem feasibility In the special case where only one constraint is given, i.e., when either p(x) = 0 or q(x) = 0, checking feasibility is straightforward. In the general case, to determine the feasibility of Problem (17), we can formulate the task as: min c: X [0,1] X q(x)c(x)dx s.t. Z X p(x)c(x)dx δ . (22) We obtain a similar problem but with only one constraint. Problem (17) is feasible if and only if Problem (22) attains a value less than or equal to σ. Proof technique Problem (17) represents an instance of infinite linear programming, grounded in established theory [17]. Especially, when X is a finite set, it reduces to a standard linear program, allowing the use of Lagrange duality to derive optimality conditions, as characterized in Theorem 1. However, when X is arbitrary and the functions R, p, q are essentially unrestricted, this approach becomes substantially more complex. In such cases, a more general duality theorem would be required, and even then, deducing the desired results would involve handling significantly more intricate optimality conditions. Given these challenges and drawing inspiration from techniques in related works (e.g. [9, 16, 18] and other studies on the Neyman-Pearson problem), we instead pursue a direct proof, provided in Appendix A. By introducing forbidden point configurations, this proof shows that each optimal solution is determined by a line separating the images of the sets X < and X > in the plane, mapped by the function κ(x) = R(x) p(x) , q(x) Extensions It remains an open problem whether our results can be generalized to the extended formulation min c: X [0,1] X R(x)c(x), dx s.t. Z X p(x)c(x), dx δ and Z X qi(x)c(x), dx σi, i I, (23) where I is a finite index set, the functions qi are Lebesgue measurable with finite integrals on X, and σi are non-negative bounds. A key question is whether there is a scoring function with 1 + |I| parameters (instead of 2) for this case. It is important to note that generalizing the current proof would involve addressing a significantly greater combinatorial complexity in the point configurations. 4 General tasks In this section, we introduce several special cases of problem (17), whose forms of solutions directly follow from Theorem 1, as we will show. These cases encompass all the examples in Section 2. Thus, all the mentioned BDM problems can be solved in their general form using the single framework proposed in this paper. To simplify notation, we represent the Lebesgue measurable functions in (17) as pi : X [0, ), i {0, 1, 2}. Furthermore, for Lebesgue measurable c: X [0, 1], we define the functionals: X pi(x) c(x) dx , i {0, 1, 2} . The generic BDM problem (17) then requires solving for given non-negative scalars δ and ρ: min c: X [0,1] F0(c) s.t. F1(c) δ and F2(c) σ . (24) Note that F0 here does not depend on the classier h, which is consistent with the assumption that h is the Bayes predictor. Problem 1: Given δ > 0, the task is to solve: min c: X [0,1] F0(c) s.t. F1(c) δ . (25) A special case is the Neyman-Pearson task (4). We derive this formulation from (24) by setting p2(x) = 0 for all x X. Lemma 1 implies that if the problem is feasible, it has an optimal solution c such that p1(x) > λ , τ(x) if p0(x) p1(x) = λ , p1(x) < λ . Note that here, Lemma 1 does not fully replicate the Neyman-Pearson strategy, in which τ is constant. Problem 2: Given δ > 0, the task is to solve: min c: X [0,1] F0(c) F1(c) s.t. F1(c) δ . (26) Special cases are the Bounded-Abstention model (8) and the original SCOD problem (12). This formulation can equivalently be transformed to (25) by showing that there is an optimal solution c such that F1(c ) = δ. Indeed, if some optimal c : X [0, ) fulfills F1(c) > δ, then we can define c = δ F1(c)c, and it holds that F1(c ) = δ and F0(c ) F1(c ) = δ F1(c)F0(c) Hence, such a c is preserved in the set of optimal solutions of the problem min c: X [0,1] F0(c) δ s.t. F1(c) δ . Problem 3: Given δ > 0, the task is to solve: max c: X [0,1] F1(c) s.t. F0(c) F1(c) δ . (27) A special case is the Bounded-Improvement model (11). Let c be an optimal solution to (27). Define a problem: min c: X [0,1] F0(c) F1(c) s.t. F1(c) F1(c ) . Clearly, c is feasible for the new problem, which is of the form (26). Moreover, any optimal solution c to this problem is also optimal to (27) because it attains the maximum F1(c ) and satisfies the constraint: F0(c) F1(c) F0(c ) Problem 4: Given δ > 0, σ > 0, the task is to solve: min c: X [0,1] F0(c) F1(c) s.t. F1(c) δ and F2(c) σ . (28) A special case is the SCOD problem formulation with constraints on FPR and TPR (14). In this case, we can transform the problem to (24) in the same way as we did it for the formulation (26). Problem 5: Given δ > 0, σ > 0, the task is to solve: min c: X [0,1] F0(c) F1(c) s.t. F1(c) δ and F2(c) F1(c) σ . (29) A special case is the SCOD problem formulation with the constraints on TPR and Precision (16) because the second constraint in (16) can be rewritten as fpr(c) tpr(c) (π 1 1)(prec 1 min 1) , which matches the form F2(c) F1(c) σ . Again, the problem transforms to (24) using the same approach as in the case of the formulation (26). 5 Conclusion In this paper, we presented a comprehensive framework for solving binary decision-making (BDM) problems, which define optimal decision strategies through constrained optimization involving Lebesgue measurable functions. We characterize all optimal strategies for a generic BDM problem and derive a specific class of optimal strategies based on comparing a single score with a decision threshold. Our framework covers a variety of BDM problems, from well-known instances like the Neyman-Pearson problem to recent developments such as the SCOD problem. We demonstrated the versatility and robustness of our framework by deriving optimal solutions for all the BDM problems discussed in our paper. This work provides a foundational approach that can be adapted to future needs in machine learning and statistical decision making, ensuring more effective and theoretically grounded solutions. 6 Acknowledgement This research was supported by the CTU institutional support (future fund). [1] Constantin Carathéodory. Über den variabilitätsbereich der koeffizienten von potenzreihen, die gegebene werte nicht annehmen. Mathematische Annalen, 64:95 115, 1907. [2] Guangyao Chen, Peixi Peng, Xiangqian Wang, and Yonghong Tian. Adversarial reciprocal points learning for open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8065 8081, 2022. [3] C. K. Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, 16(1):41 46, 1970. [4] C. Corbiere, N. Thome, A. Bar-Hen, M. Cord, and P. Perez. Addressing failure prediction by learning model confidence. In Advances in Neural Information Processing Systems, volume 32, pages 2902 2913, 2019. [5] Terrance De Vries and Graham W Taylor. Learning confidence for out-of-distribution detection in neural networks. ar Xiv preprint ar Xiv:1802.04865, 2018. [6] Akshay Raj Dhamija, Manuel Günther, and Terrance Boult. Reducing network agnostophobia. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. [7] Vojtech Franc, Jakub Paplham, and Daniel Prusa. SCOD: From heuristics to theory. In Proceedings of European Conference on Computer Vision, 2024. [8] Vojtech Franc and Daniel Prusa. On discriminative learning of prediction uncertainty. In Proc. of Machine Learning Research International Conference on Machine Learning, June 2019. [9] Vojtech Franc, Daniel Prusa, and Vaclav Voracek. Optimal strategies for reject option classifiers. Journal of Machine Learning Research, 24(11):1 49, 2023. [10] Federica Granese, Marco Romanelli, Daniele Gorla, Catuscia Palamidessi, and Pablo Piantanida. Doctor: A simple method for detecting misclassification errors. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 5669 5681. Curran Associates, Inc., 2021. [11] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In Proceedings of International Conference on Learning Representations, 2017. [12] H. Jiang, B. Kim, M. Y. Guan, and M. Gupta. To trust or not to trust a classifier. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, page 5546 5557, 2018. [13] B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, volume 30, pages 6402 6413, 2017. [14] Shiyu Liang, Yixuan Li, and R. Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. In International Conference on Learning Representations, 2018. [15] Andrey Malinin and Mark Gales. Predictive uncertainty estimation via prior networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. [16] Harikrishna Narasimhan, Aditya Krishna Menon, Wittawat Jitkrittum, and Sanjiv Kumar. Plugin estimator of selective classification with out-of-distribution detection. In ICLR, 2024. [17] P. Nash and Edward James Anderson. Linear programming in infinite-dimensional spaces : theory and applications. Wiley, 1987. [18] Jerzy Neyman and Egon Pearson. On the use and interpretation of certain test criteria for purpose of statistical inference. Biometrica, pages 175 240, 1928. [19] Tadeusz Pietraszek. Optimizing abstaining classifiers using ROC analysis. In Proceedings of the 22nd International Conference on Machine Learning, page 665 672, 2005. [20] Yue Song, Nicu Sebe, and Wei Wang. Rankfeat: Rank-1 feature removal for out-of-distribution detection. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 17885 17898. Curran Associates, Inc., 2022. [21] Yiyou Sun, Chuan Guo, and Yixuan Li. React: Out-of-distribution detection with rectified activations. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 144 157. Curran Associates, Inc., 2021. [22] Yiyou Sun, Yifei Ming, Xiaojin Zhu, and Yixuan Li. Out-of-distribution detection with deep nearest neighbors. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 20827 20840. PMLR, Jul 2022. [23] Haoqi Wang, Zhizhong Li, Litong Feng, and Wayne Zhang. Vim: Out-of-distribution with virtual-logit matching. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4911 4920, 2022. [24] Guoxuan Xia and Christos-Savvas Bouganis. Augmenting softmax information for selective classification with out-of-distribution data. In Proceedings of the Asian Conference on Computer Vision (ACCV), pages 1995 2012, December 2022. A Appendix (proof of the main theorem) For a metric vector space V and its subset A V , we write conv(A), int(A) and span(A) to denote the convex hull, interior, and span of A, respectively. Lemma 2. Let A be a subset of R2 with dim span(A) = 2. Given a point x located in the interior of the convex hull of A, there exists a convex polygon P such that x lies in the interior of P. Moreover, all vertices of the polygon P are elements of the set A. Proof. Given that x int(conv(A)), there is a triangle T conv(A) such that x int(T). According to Carathéodory s theorem [1], each vertex of T lies in some simplex whose vertices are in the set A. Let these simplices be denoted as S1, S2, and S3. Defining P = conv(S1 S2 S3) forms a convex polygon P with vertices in A, and x is in the interior of P. Proof of Theorem 1. Let X = {x X | p(x) > 0}. For all x X, we define g(x) = q(x) r(x) = R(x) p(x) . We will associate elements of X with points in a plane by introducing the mapping κ : X R2 + such that κ(x) = (g(x), r(x)). The open ε-ball of π(x) in R2 is denoted as Bε(x). For a measurable A R2 +, define κ 1(A) p(x)dx , (30) c(A) = 1 p(A) R κ 1(A) p(x)c(x)dx if p(A) > 0 , 0 otherwise. (31) Problem (17) depends only on the subset X + = x X | ε > 0 : p(Bε(x)) > 0 X. To see this, let us define Y1 = {x X | p(x) = 0} and Y2 = {x X | ε > 0 : p(Bε(x)) = 0}. Then we have Z X\X + p(x) dx = Z Y1 p(x)dx + Z Y2 p(x)dx = 0, (32) since, for each open ε-ball Bε(x), there exists a circle Q(x) Bε(x) that contains x and has rational radius and center coordinates. Thus, Y2 can be covered by countably many such circles on which the integral of p is zero. As a result, (32) implies that the elements of X \ X + do not affect the left-hand side of the first constraint and do not negatively impact the objective function or the second constraint. More precisely, for any feasible function c : X [0, 1], if c (x) = c(x) if x X + , 0 otherwise, (33) then it holds R X R(x)c (x)dx R X R(x)c(x)dx, R X p(x)c (x)dx = R X p(x)c(x)dx, and R X q(x)c (x)dx R X q(x)c(x)dx. For the next considerations, we further distinguish three significant subsets of X + with respect to a fixed optimal solution c : X [0, 1]. X0 = x X + | ε > 0 : c (Bε(x)) = 0 , X1 = x X + | ε > 0 : c (Bε(x)) = 1 , X2 = x X + | ε > 0 : 0 < c (Bε(x)) < 1 . X0 X1 X2 = X +, c (Bε(x)) = 0 0 < ε ε : c (Bε (x)) = 0, and c (Bε(x)) = 1 0 < ε ε : c (Bε (x)) = 1. To confirm the existence of suitable λ and µ, stated by the theorem for c , it is sufficient to demonstrate that the sets A0 = {κ(x) | x X0} , (34) A1 = {κ(x) | x X1} , (35) A2 = {κ(x) | x X2} (36) are "almost" linearly separable. This necessitates finding a line L R2 that includes A2 and linearly separates A0 \ L and A1 \ L. The existence of such an L is ensured if dim span ((conv(A0) conv(A1)) A2) < 2 . (37) The proof continues by listing some general configurations of points from A0, A1, A2 that contradict the optimality of c . These forbidden configurations will be used to confirm the validity of condition (37). But we firstly establish estimates on R κ 1(Bε(y)) q(x)c (x)dx and R κ 1(Bε(y)) R(x)c (x)dx, for each y X and ε > 0. For a given y X, let κ(y) = (a, b) = q(y) p(y), R(y) Since, for all x κ 1(Bε(y)), it holds we easily obtain a ε p(Bε(y))c (Bε(y)) Z κ 1(Bε(y)) q(x)c (x)dx a + ε p(Bε(y))c (Bε(y)) , (40) p(Bε(y))c (Bε(y)) Z κ 1(Bε(y)) R(x)c (x)dx b + ε p(Bε(y))c (Bε(y)) . (41) Claim 1.1. Let x1, x2 X +, q(x1) p(x1) > q(x2) p(x2), and R(x1) p(x1) > R(x2) p(x2) . If c is optimal, then x1 X0 or x2 X1. Proof of the claim. We proceed by contradiction. Assume that x1 X1 X2 and x2 X0 X2. This implies the existence of an ε > 0 such that c (Bε(x1)) > 0 and c (Bε(x2)) < 1. Without loss of generality, choose ε sufficiently small so that p(x1) q(x2) p(x1) R(x2) p(x2) , (43) which further implies Bε(x1) Bε(x2) = . For i {1, 2}, denote Bi = Bε(xi), ai = q(xi) p(xi), and bi = R(xi) Define a function c : X [0, 1] as follows: c (x) ν p(B1) c (B1)c (x) if x B1, c (x) + ν p(B1) p(B2) c (B2)(1 c (x)) if x B2, c (x) otherwise. Here, the parameter ν is defined as: ν = min c (B1), p(B2) p(B1)(1 c (B2)) > 0 , which ensures 0 < ν p(B1) c(B1) 1 as well as 0 < ν p(B1) p(B2) c(B2) 1, hence c is defined correctly. By comparing c and c , we will show that c is not optimal. Observe that Z X p(x)c (x)dx Z X p(x)c (x)dx = c (B1)c (B1) + ν p(B1) p(B2) c(B2)(p(B2) c (B2)) = 0 , Moreover, Z X q(x)c (x)dx Z X q(x)c (x)dx = ν p(B1) κ 1(B1) q(x)c (x)dx + ν p(B1) p(B2) c (B2) κ 1(B2) q(x)(1 c (x))dx (40) ν p(B1) c (B1) + ν p(B1) p(B2) c (B2) (p(B2) c (B2)) = νp(B1)(a2 a1 + ε) (42) < 0 . Analogously, we can derive Z X R(x)c (x)dx Z X R(x)c (x)dx νp(B1)(b2 b1 + ε) (43) < 0 . The relationships above contradict the optimality of c . Claim 1.2. Let x1, x2, x3 be distinct elements of X +, and let α1, α2, β be non-negative real numbers such that α1 + α2 = 1, β = 0, and βκ(x3) = α1κ(x1) + α2κ(x2). If c is optimal, it holds: If β < 1, then either x3 X0 or {x1, x2} X1 = , if β > 1, then either x3 X1 or {x1, x2} X0 = . Proof of the claim. We will give a proof for β < 1 and note that the steps for β > 1 are analogous. For i {1, 2, 3}, denote κ(xi) = (ai, bi). Find an ε > 0 such that ε < (1 β)a3 , (44) ε < (1 β)b3 , (45) and Bε(x1), Bε(x2), Bε(x3) are pairwise disjoint. By contradiction, assume that c (Bε(x1)) < 1, c (Bε(x2)) < 1, and c (Bε(x3)) > 0. To simplify the notation, for i {1, 2, 3}, let Bi = Bε(xi), ci = c (Bi), and pi = p(Bi). Note that, according to the definition of X +, it holds p1, p2, p3 > 0. Define a selective function c : X [0, 1] as follows: c (x) + να1p3 (1 c1)p1 (1 c (x)) if x B1, c (x) + να2p3 (1 c2)p2 (1 c (x)) if x B2, c (x) ν c3 c (x) if x B3, c (x) otherwise. 0 = for any z > 0, the value of ν is given by: ν = min c3, p1 α1p3 (1 c1), p2 α2p3 (1 c2) . It is easy to see that ν > 0, and 0 < να1p3 (1 c1)p1 , να2p3 (1 c2)p2 , ν Now, observe that Z X p(x)c (x)dx Z X p(x)c (x)dx = Z να1p3 (1 c1)p1 p(x)(1 c (x))dx να2p3 (1 c2)p2 p(x)(1 c (x))dx Z c3 p(x)c (x)dx = να1p3 + να2p3 νp3 = νp3(α1 + α2 1) = 0 . Next, derive Z X q(x)c (x)dx Z X q(x)c (x)dx = Z να1p3 (1 c1)p1 q(x)(1 c (x))dx να2p3 (1 c2)p2 q(x)(1 c (x))dx Z c3 q(x)c (x)dx (40) να1p3 a1 + ε + να2p3 a2 + ε = νp3 α1a1 + α2a2 a3 + ε 2(α1 + α2) + ε = νp3 (βa3 a3 + ε) (44) < 0 Quite similarly, Z X R(x)c (x)dx Z X R(x)c (x)dx νp3 (βb3 b3 + ε) (45) < 0 . Claim 1.3. Let z1, . . . , zn X + and x X +, and let P be a convex polygon with vertices κ(y1), . . . , κ(yn), with κ(x) as its inner point. The function c is not optimal if one of the following cases holds: 1. All zi s are from X0 and x X1. 2. All zi s are from X1 and x X0. Proof of the claim. For the first case, we identify an edge of the polygon P that is first intersected by the half-line H starting from the origin (0, 0) and passing through the point κ(x) (see Figure 1a). This edge has vertices κ(y1) and κ(y2) for some y1, y2 {z1, . . . , zn}. The points κ(x), κ(y1), and κ(y2) are non-collinear and satisfy βκ(x) = α1κ(z1) + α2κ(z2) for some non-negative real numbers α1, α2, and β, where α1 + α2 = 1 and β < 1. According to Claim 1.2, these points contradict the optimality of c . Similarly, for the second case, we identify an edge of P that is the last one to be intersected by the halfline H (see Figure 1b). Let κ(y1) and κ(y2) be vertices of this edge for some y1, y2 {z1, . . . , zn}. The points κ(x), κ(y1), and κ(y2) satisfy βκ(x) = α1κ(y1) + α2κ(y2) for some non-negative real numbers α1, α2, and β, where α1 + α2 = 1 and β > 1. Again, Claim 1.2 contradicts the optimality of c . Claim 1.4. If x1, x2 X0 and x3, x4 X1, and the line segments with endpoints κ(x1), κ(x2), and κ(x3), κ(x4) intersect at a point lying in the interior of both segments, then the function c is not optimal. Figure 1: The infeasible configuration described by Claim 1.3. For simplicity, points are denoted by their pre-images from X +. Figure 2: The infeasible configuration described by Claim 1.4. Proof of the claim. We consider two possible situations depicted in Figure 2. First, we examine the scenario where one of the points κ(x1), κ(x2) is located lower-left relative to the other one, specifically, we assume g(x1) g(x2) and r(x1) r(x2), as illustrated in Figure 2a. According to Claim 1.1, the points κ(y1), κ(y2) cannot be situated within the shaded rectangular area. Since the line segments intersect, one of the points κ(y1), κ(y2) must be positioned upper-left from κ(x2), while the other lies lower-right from κ(x2) (these regions are labeled as A and B, respectively, in Figure 1a). However, such an arrangement contradicts Claim 1.2. Secondly, we consider the configuration depicted in Figure 2b. One of the points κ(y1), κ(y2) must lie within the half-plane bounded by the line passing through κ(x1), κ(x2), and containing the origin (0, 0). Let s assume this point is y2. As per Claim 1.1, κ(y2) cannot reside in the shaded area. Within this considered half-plane, there are three remaining triangular regions denoted as A, B, and C. Nevertheless, Claim 1.2 prohibits κ(y2) from lying within region B (by applying the claim to the triple x1, x2, and y2), region C (consider the triple y1, y2, and x2), as well as region A (consider the triple y1, y2, and x1). Claim 1.5. If z1, z2 X2 and κ(z1), κ(z2) are distinct points, then the slope of the line L passing through κ(z1) and κ(z2) is not positive. Moreover, the open half-plane determined by L and the origin (0, 0) does not contain any point from A1 A2, and similarly, the opposite open half-plane does not contain any point from A0 A2. Proof of the claim. If the slope of L is positive, as shown in Figure 3a, then the points κ(z1) and κ(z2) conform to Claim 1.1, implying that c is not optimal. If the slope of L is not positive, referring to Figure 3b, we deduce that the open half-plane H0 does not contain any point κ(x) from A0 A2. By Claim 1.1, such a point P cannot be in the shaded area. It neither can be in areas A, B, or C because in each case, the points x, z2, and z1, or z1, z2, and x, or z1, x, and z2 would conform to Claim 1.2. Figure 3: The infeasible configuration described by Claim 1.5. Figure 4: The infeasible configuration described by Claim 1.6. Similarly, it can be argued that a point from A1 A2 cannot lie in the opposite open half-plane. Claim 1.6. Let x1, x2 X1, y1, y2 X0, z X2. If κ(x1), κ(x2), κ(y1), and κ(y2) are distinct points lying on a line L in such a way that κ(y1) is between κ(x1) and κ(x2), and κ(x2) is between κ(y1) and κ(y2), then κ(z) lies on L unless the function c is not optimal. Proof of the claim. Firstly, observe that the configuration in Figure 4a is not admissible for an optimal c , because κ(y1) is located in the shaded area containing points to the left and down from κ(x2) (Claim 1.1 applies here). Hence, it suffices to inspect the case with the opposite slope of L depicted in Figure 4b. Note that, without loss of generality, we assume that the highest point on L is κ(y2). The case when the highest point is either κ(x1) or κ(x2) is handled similarly. If c is optimal, then κ(z) cannot lie in the two shaded areas (Claim 1.1). The remaining area is split into 6 different parts by L, denoted from A to F. If κ(z) lies in any of these parts, then we can always select two elements from {x1, x2, y1, y2} that together with z conform to Claim 1.2. These selections are as follows: A: z, y1, x2, B : x1, x2, z, C : z, y1, x1, D: z, x2, y2, E : y1, y2, z, F : z, x2, y1. We are ready to confirm condition (37), by listing all the potential infeasible cases and showing that they contradict the optimality of c . 1. dim span(conv(A0) conv(A1)) = 2. We distinguish three subcases. Assume that, for some x X0, κ(x) is in the interior of conv(A0) conv(A1). By Lemma 2, there is a convex polygon P with vertices in κ(X1) such that the interior of P contains x. Apply Claim 1.3. Next, assume that x belongs to X1 instead of X0. Analogously, find a convex polygon P with vertices in κ(X0), containing x in its interior. Apply Claim 1.3. Finally, if int(conv(A0) conv(A1)) (A0 A1) = , take an arbitrary x int(conv(A0) conv(A1)). By Carathéodory s theorem, there exists a simplex S0 (a line segment or triangle) with vertices in A0 such that κ(x) int(S0). Similarly, there exists a simplex S1 (a line segment or triangle) with vertices in A1 such that κ(x) int(S1). It is not difficult to see that the boundary line segments of S0 and S1 intersect, meaning there are two intersecting line segments, one with endpoints in A0, and the other with endpoints in A1. Hence, Claim 1.4 contradicts the optimality of c . 2. dim span(A2) = 2. There exist x1, x2, x3 X2 such that κ(x1), κ(x2), κ(x3) are noncollinear. Such a configuration is prohibited by Claim 1.5. 3. dim span(conv(A0) conv(A1)) = 1, dim span(A2) 1 and dim span((conv(A0) conv(A1)) A2) = 2. There are x1, x2 X0, x3, x4 X1, x5 X2 conforming to Claim 1.6. 4. dim span(conv(A0) conv(A1)) = 0, dim span(A2) = 1, and dim span((conv(A0) conv(A1)) A2) = 2. There are z1, z2 X2 such that κ(z1) and κ(z2) conform to Claim 1.5, i.e., the line L passing through κ(z1) and κ(z2) determines two opposite half-planes, H0 and H1. By Claim 1.5, A0 H0 and A1 H1. This means that conv(A0) conv(A1) L, which contradicts the assumption dim span((conv(A0) conv(A1)) A2) = 2. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The contribution is summarized in the abstract and explained in the introductory section. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The paper explains that tuning the parameters of the score function that characterizes optimal solutions must be approached individually for each problem. Additionally, it notes that extending the main result to incorporate more constraints remains an open problem. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: The full proof of the main theorem is in the appendix, the assumptions are stated in the main paper. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: The paper does not include any experiments. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: The paper does not include any experiments. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: The paper does not include any experiments. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: The paper does not include any experiments. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: The paper does not include any experiments. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: Our research conducted in the paper conforms with the Neur IPS Code of Ethics which we have reviewed. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: We present a mathematical result which does not have any societal impact. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper does not utilize any data nor models. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: The paper does not use any assets. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.