# occam_towards_costefficient_and_accuracyaware_classification_inference__00bd32d3.pdf Published as a conference paper at ICLR 2025 OCCAM: TOWARDS COST-EFFICIENT AND ACCURACYAWARE CLASSIFICATION INFERENCE Dujian Ding, Bicheng Xu, Laks V. S. Lakshmanan University of British Columbia dujian.ding@gmail.com {bichengx, laks}@cs.ubc.ca Classification tasks play a fundamental role in various applications, spanning domains such as healthcare, natural language processing and computer vision. With the growing popularity and capacity of machine learning models, people can easily access trained classifiers as a service online or offline. However, model use comes with a cost and classifiers of higher capacity (such as large foundation models) usually incur higher inference costs. To harness the respective strengths of different classifiers, we propose a principled approach, OCCAM, to compute the best classifier assignment strategy over classification queries (termed as the optimal model portfolio) so that the aggregated accuracy is maximized, under user-specified cost budgets. Our approach uses an unbiased and low-variance accuracy estimator and effectively computes the optimal solution by solving an integer linear programming problem. On a variety of real-world datasets, OCCAM achieves 40% cost reduction with little to no accuracy drop. 1 INTRODUCTION Classification is a fundamental task with numerous real-world applications, from medical diagnosis (Abhisheka et al., 2024) to natural language processing (Kowsari et al., 2019) and object recognition (Zou et al., 2023). In recent years, a wealth of pre-trained classifiers has become accessible, both online and offline, at varying costs. For instance, Google s Facial Detection service (Google) charges $1.50 per 1,000 images, providing developers with a reliable solution for face recognition. Simultaneously, a variety of open-source models, such as Res Net (He et al., 2016), Vision Transformer (Dosovitskiy et al., 2020), and Swin Transformer (Liu et al., 2021), have been made available for developers to integrate into their services. On the one hand, empirical evaluations conducted in (Su et al., 2018) and our independent assessment (see Figure 1a), consistently indicate that smaller/cheaper models tend to exhibit a gap in classification accuracy compared to their larger counterparts. On the other hand, though larger neural models are equipped with higher capacity, they often come with higher costs, e.g., hardware usage and latency, for both training and inference. This can potentially impose an enormous cost on both end users of classification services and the service providers (e.g., Google1, Amazon2, and Microsoft3). Confronted with the general trade-off between classification accuracy and inference cost, we advocate a hybrid inference framework which seeks to combine the advantages of both small and large models. Specifically, we study the problem, how to optimally select and assign models to different classification queries in order to achieve the highest accuracy while adhering to a limited cost budget? We formally refer to it as the optimal model portfolio problem (details in Section 3). Our approach is motivated by the observation that while small classifiers typically have reduced accuracy over the population, they can still agree with large classifiers on certain queries a large proportion of the time, which suggests the existence of a subset of easy queries on which even small classifiers can make the right prediction. This is also illustrated in Figure 1b where we plot the frequency with which different classifiers successfully make the right prediction on the same classification queries. Taking 1https://cloud.google.com/prediction 2https://aws.amazon.com/machine-learning 3https://studio.azureml.net Published as a conference paper at ICLR 2025 Accuracy (%) (a) Accuracy v/s classifier sizes. 1.0 0.890.910.920.930.940.94 0.83 1.0 0.890.910.920.940.94 0.830.87 1.0 0.920.920.940.94 0.800.850.87 1.0 0.910.930.94 0.770.830.850.88 1.0 0.950.95 0.750.800.820.850.90 1.0 0.94 0.750.810.820.860.900.94 1.0 (b) Classifier agreement frequency. 0.2 0.4 0.6 0.8 1.0 Normalized Cost Budget Accuracy (%) Random OCCAM (c) OCCAM results. Figure 1: We investigate Tiny Image Net dataset consisting of 200 classes (see Section 5 for details). We observe that (a) smaller classifiers (e.g., Res Net-18) generally yield lower accuracy, (b) small classifiers can agree with large classifiers at a high frequency (each entry indicates the percentage of queries on which the classifier on the row makes the right prediction and so does the classifier on the column), (c) our approach OCCAM achieves 20% cost reduction with less than 1% accuracy drop. image classification as an example, a small classifier, Res Net-18 (He et al., 2016), can correctly classify 75% of the images on which the large classifier, Swin V2-B (Liu et al., 2022), makes the right prediction, suggesting that we can replace Swin V2-B with Res Net-18 on these image queries, saving significant inference costs without any accuracy drop (details in Section 5). With this insight, we propose a principled approach, Optimization with Cost Constraints for Accuracy Maximization (OCCAM), to effectively identify easy queries and assign classifiers to different user queries to maximize the overall classification accuracy subject to the given cost budgets. Previous work (Chen et al., 2022) trains ML models to select classifiers, which requires sophisticated configuration and lacks performance guarantees that are critical in real-world scenarios. In this paper, by leveraging weak assumptions, such as the well-separation assumption (Yang et al., 2020), we demonstrate that it is possible to compute optimal model assignments with statistical guarantees. As an example, in previous work (Yang et al., 2020), the image classification task (Singh & Singh, 2020), a critical and well-studied application of classification, has been observed to empirically satisfy the well-separation assumption, which is also corroborated by our own independent evaluation (see Appendix A.1). The intuition is that for well-separated classification problems such as image classification, we can learn robust classifiers that have similar performance on similar queries. By leveraging the specific structure of well-separated classification tasks, we develop an unbiased and low-variance estimator for classifier test accuracy with asymptotic guarantees. For each classification query, we compute its nearest neighbours in pre-computed samples to estimate the test accuracy for each classifier. To our best knowledge, we are the first to open up the black box by developing a white-box accuracy estimator for ML classifiers with statistical guarantees. Next, armed with our classifier accuracy estimator, we compute the optimal classifier assignment strategy over all classification queries (optimal model portfolio) subject to a given cost budget by solving an integer linear programming (ILP) problem (see Section 4). As a preview, Figure 1c shows that OCCAM can achieve 20% cost reduction with less than 1% accuracy drop. We show even higher cost reduction with little to no accuracy drop on various real-world datasets in Section 5. Figure 2 depicts the overall pipeline of OCCAM: (1) draw samples and pre-compute the ML classifier accuracy for each sample, (2) for each test query, use its nearest neighbour from all samples to estimate the test accuracy of each classifier, (3) compute the optimal model portfolio w.r.t. cost budgets by solving the ILP problem. Our main technical contributions are: (1) we formally define the optimal model portfolio problem to reduce overall inference costs while maintaining high performance subject to user-specified cost budgets (Section 3); (2) we propose a novel and principled approach, OCCAM, to effectively compute the optimal model portfolio with statistical guarantees under weak assumptions (Section 4); and (3) as an illustration of the usefulness of OCCAM, we conduct extensive experiments on the image classification task given its well-separation property and demonstrate the effectiveness of OCCAM on a variety of real-world datasets (Section 5). Published as a conference paper at ICLR 2025 (2) Accuracy Estimation Using Nearest Neighbours (3) Compute Optimal Model Portfolio Under Budgets Nearest Neighbour Samples User Queries ML Classifiers ML Classifiers ML Classifiers Classification Queries Cost Budgets: $$$$$ Cat Cat Capybara (1) Pre-compute Samples Optimal Model Portfolio Figure 2: OCCAM: Optimization with Cost Constraints towards Accuracy Maximization. 2 RELATED WORK Efficient Machine Learning (ML) Inference. Efficient ML inference is crucial for real-time decisionmaking in various applications such as autonomous vehicles (Tang et al., 2021), healthcare (Miotto et al., 2018), and fraud detection (Alghofaili et al., 2020). It involves applying a pre-trained ML model to make predictions, where the inference cost is expected to dominate the overall cost incurred by the model. Model compression, which replaces a large model with a smaller model of comparable accuracy, is the most common approach employed for enhancing ML inference efficiency. Common techniques for model compression include model pruning (Hassibi et al., 1993; Le Cun et al., 1989; Ding et al.), quantization (Jacob et al., 2018; Vanhoucke et al., 2011), knowledge distillation (Hinton et al., 2015; Urban et al., 2016), and neural architecture search (Elsken et al., 2019; Zoph & Le, 2016). These static efficiency optimizations typically lead to a fixed model with lower inference cost but also reduced accuracy compared to its larger counterpart, which may not suffice in highly sensitive applications like collision detection (Wang et al., 2021) and prognosis prediction (Zhu et al., 2020). This shortcoming is already evident in the inference platforms discussed in Section 1, highlighting the need for more dynamic optimizations to effectively address the diverse demands of users. Hybrid ML Inference. Recent works (Kag et al., 2022; Ding et al., 2022; 2024) have introduced a novel inference paradigm termed hybrid inference, which invokes models of different sizes on different queries, as opposed to employing a single model on all inference queries. The smaller model generally incurs a lower inference cost but also exhibits reduced accuracy compared to the larger model. The key idea is to identify easy inference queries on which the small models are likely to make correct predictions and invoke small models on them when cost budgets are limited, thereby reducing overall inference costs while preserving solution accuracy. By adjusting the cost budgets, users can dynamically trade off between accuracy and cost within the same inference setup. Kag et al. (2022); Ding et al. (2022; 2024) consider a simple setting of only one large and one small model and do not allow for explicit cost budget specification, which could be necessary for production scenarios. Chen et al. (2020) studies a setup with multiple ML models and learns an adaptive strategy to generate predictions by calling a base model and sometimes an add-on model when the base model quality scores are lower than the learned thresholds. However, both base and add-on models are selected in a probabilistic manner and this approach fails to satisfy the user-specified cost budgets deterministically. Chen et al. (2022) studies a similar setup with multiple ML models and allocates cost budgets according to model-based accuracy prediction. This approach requires a separate training phase for the accuracy predictor, which needs a large amount of training data, and provides no guarantee on the prediction quality. Mixture-of-experts (Mo E) (Shazeer et al., 2017) also involves dynamically deciding which model (or expert ) processes a given query based on specific criteria. However, Mo E often assumes homogenous experts (Fedus et al., 2022) with routing decisions made Published as a conference paper at ICLR 2025 by learned gating functions that require significant computational resources to re-train if we want to add/update/delete experts. Unlike previous works, we propose an unbiased and low-variance accuracy estimator with asymptotic guarantees, based on which we present a novel training-free approach, OCCAM, to effectively compute the optimal assignment of diverse classifiers to given queries, under the cost budgets specified by users. It is worth pointing out that OCCAM is a an instance from the Algorithm Selection problem (Rice, 1976) with a specific focus on classification tasks and pre-trained classifiers. Image Classification. Image classification is a fundamental task in computer vision, where given an image, a label needs to be predicted. It serves as an essential building block for many high-level AI tasks, e.g., image captioning (Vinyals et al., 2015) and visual question answering (Antol et al., 2015), where objects need to be first recognized. With the growing capacity of deep learning models, from convolutional neural networks (CNN) (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014) to Transformer architectures (Dosovitskiy et al., 2020; Liu et al., 2021), the classification accuracy on standard image classification benchmarks (Russakovsky et al., 2015) has been greatly improved. In this work, we use the image classification task to illustrate and evaluate our proposed approach OCCAM given its well-separation property (Yang et al., 2020). We utilize both CNN (e.g., Res Net models) and Transformer (e.g., Swin Transformers) image classifiers in our evaluation. 3 PROBLEM DEFINITION Let X Rd be an instance space (e.g., images) equipped with a metric dist: X X R 0, and [C] = {1, 2, , C} be the set of possible labels with C 2. Let X contain C disjoint classes, X (1), X (2), , X (C) where for each i [C], all x X (i) have label i. Let f1, f2, , f M be a set of classifiers, with bi being the cost of a single inference call of fi. Given a query x X, each classifier fi outputs a single label from [C] at the cost bi. We define a model portfolio as follows. Definition 3.1 (Model Portfolio). Given queries X X to be classified and classifiers f1, f2, , f M, a model portfolio µ is a mapping µ : X [M] such that each x X is classified by the classifier fµ(x). We assume an oracle classifier O: X [C] which outputs the ground truth label O(x) for all queries x X. Given a finite set of queries X X, the accuracy of a model portfolio µ on X is the frequency of the ground truth labels correctly predicted by µ, i.e., Accuracy X(µ) = x X 1{fµ(x)(x)=O(x)} |X| , where 1{condition} is an indicator function that outputs 1 iff condition is satisfied. Similarly, the cost of model portfolio µ on X is the sum of all inference costs incurred by executing µ on X, i.e., Cost X(µ) = P x X bµ(x). We will use the notation Accuracy(µ), Cost(µ) when X is clear from the context. We define our problem as follows. Definition 3.2 (Optimal Model Portfolio). Given queries X X, a cost budget B R+, and classifiers f1, f2, , f M, find the optimal model portfolio µ such that Cost(µ ) B and Accuracy(µ ) Accuracy(µ), for all model portfolios µ with Cost(µ) B. 4 METHODOLOGY We describe the general framework to solve the optimal model portfolio problem in the next sections. Our overall strategy consists of two steps. Firstly, we propose an unbiased low-variance estimator for the accuracy of any given model portfolio µ, with asymptotic guarantees. Next, we describe how to determine the optimal model portfolio µ by formulating it as an integer linear programming (ILP) problem, subject to user-specified budget constraints. All proofs can be found in Appendix B. 4.1 ESTIMATING Accuracy(µ) Previous work on Hybrid ML (Kag et al., 2022; Chen et al., 2022; Ding et al., 2024) typically relies on training a neural router to predict the accuracy of a given set of classifiers for given user queries, based on which queries are routed to different classifiers. Such a paradigm not only involves a non-trivial training configuration but also lacks estimation guarantees which can be critical in scientific and production settings. We propose a principled approach to estimate the test accuracy of Published as a conference paper at ICLR 2025 a given model portfolio for given user queries. By leveraging the specific structure of well-separated classification problems like image classification, we propose an unbiased low-variance estimator for the test accuracy with asymptotic guarantees. Without loss of generality, we consider the wide class of soft classifiers in this study. Given query x X, a soft classifier first outputs a distribution over all labels [C], based on which it then makes prediction at random. Given a soft classifier fi, we abuse the notation and let fi(x)[j] denote the likelihood that fi predicts label j [C], that is, fi(x)[j] := Pr[fi(x) = j], for query x X. Deterministic classifiers (e.g., the oracle O) can be seen as a special case of soft classifiers with one-hot distribution over all labels. In practice, from softmax classifiers (e.g., Res Net), soft classifiers can be constructed by simply sampling w.r.t. the probability distribution output by the softmax layer. Clearly, given a model portfolio µ, Accuracy(µ) is a random variable due to the random nature of the soft classifiers. The expected accuracy of any given model portfolio µ is, E[Accuracy(µ)] = E[ P x X 1{fµ(x)(x)=O(x)} P x X E[1{fµ(x)(x)=O(x)}] P x X fµ(x)(x)[O(x)] (1) where the last equality follows from E[1{fµ(x)(x) = O(x)}] = 1 Pr[fµ(x)(x) = O(x)] = fµ(x)(x)[O(x)]. Note that fi(x)[O(x)] is the success probability that the classifier fi correctly predicts the ground truth label for query x. For brevity, we define SPi(x) := fi(x)[O(x)] and rewrite the expected accuracy as E[Accuracy(µ)] = x X SPµ(x)(x) The exact computation of success probability is intractable since the ground truth of user queries is unknown a priori. We propose a novel data-driven approach to estimate it for any classifier and show that our estimator is unbiased and low-variance with asymptotic guarantees, for well-separated classification problems like image classification. Based on this, we develop a principled approach for estimating the expected accuracy of a given model portfolio. Definition 4.1 (r-separation (Yang et al., 2020)). We say a metric space (X, dist) where X = i [C]X (i) is r-separated, if there exists a constant r > 0 such that dist(X (i), X (j)) r, i = j, where dist(X (i), X (j)) = minx X (i),x X (j) dist(x, x ). In words, in an r-separated metric space, there is a constant r > 0, such that the distance between instances from different classes is at least r. The key observation is that many real-world classification tasks comprise of distinct classes. For instance, images of different categories (e.g., gold fish, bullfrog, etc.) are very unlikely to sharply change their classes under minor image modification. It has been widely observed (Yang et al., 2020) that the classification problem on real-world images empirically satisfies r-separation under standard metrics (e.g., l norm). We also observe similar patterns on a number of standard image datasets (e.g., Tiny Image Net) and provide more empirical evidence in Appendix A.1. With this observation, we can show that the oracle classifier O is Lipschitz continuous 4 (Eriksson et al., 2004). Definition 4.2 (Lipschitz Continuity). Given two metric spaces (X, d X) and (Y, d Y) where d X (resp. d Y) is the metric on the set X (resp. Y), a function f: X Y is Lipschitz continuous if there exists a constant L 0 s.t. x, x X : d Y(f(x), f(x )) L d X(x, x ) (2) and the smallest L satisfying Equation (2) is called the Lipschitz constant of f. Lemma 4.3. There exists an oracle classifier O which is Lipschitz continuous if the metric space associated with the instances X is r-separated. If we further choose the classifiers fi to be Lipschitz continuous (e.g., MLP (Bartlett et al., 2017), Res Net (Gouk et al., 2021), Lipschitz continuous Transformer (Qi et al., 2023)), we can show that the success probability function SPi(x) (i.e., the likelihood that a classifier fi successfully predicts the ground truth label for query x) is also Lipschitz continuous. Lemma 4.4. The success probability function SPi(x) = fi(x)[O(x)] is Lipschitz continuous if fi(x) and O(x) are Lipschitz continuous. An important implication of Lemma 4.4 is that, given a classifier, we can estimate its success probability on query x by its success probability on a similar query x . Let Li > 0 denote the Lipschitz 4The Lipschitz continuity for soft classifiers is defined w.r.t. the output distribution. Published as a conference paper at ICLR 2025 constant for SPi. For any x, x X, we have the estimation error bounded by |SPi(x ) SPi(x)| Li dist(x, x ), which monotonically decreases as dist(x, x ) approaches 0 5. In practice, we can pre-compute a labelled sample S X (e.g., pre-compute classifier outputs on sampled queries from the validation set) and compute NNS(x), the nearest neighbour of x in S, for success probability estimation. We show that the estimator is asymptotically unbiased, as sample size increases. Lemma 4.5 (Asymptotically Unbiased Estimator). Given query x, a classifier fi, and uniformly sampled S X, lim s E[SPi(NNS(x))] = SPi(x) (3) where s is the sample size and NNS(x) := arg minx S dist(x, x ) is the nearest neighbour of x in sample S. In practice, we draw K i.i.d. samples, S1, S2, , SK, and compute the average sample accuracy d SP i(x) := 1 K PK k=1 SPi(NNSk(x)) as the estimator of the test accuracy on query x, for each classifier fi. It follows from Lemma 4.5 that d SP i is also an asymptotically unbiased estimator. We further show below that d SP i is an asymptotically low-variance estimator to SPi, as K increases. Lemma 4.6 (Asymptotically Low-Variance Estimator). Given query x, a classifier fi, and K i.i.d. uniformly drawn samples S1, S2, , SK of size s, let σ2 i denote the variance of the estimator d SP i(x). We have that σ2 i is asymptotically proportional to 1 K as both s and K increase. 4.2 COMPUTING µ WITH Accuracy(µ) In the previous section, we show how to estimate the accuracy for a given model portfolio. For each classifier fi and query x, we propose to estimate its success probability SP i(x) based on similar queries from labelled samples d SP i(x), which can be efficiently pre-computed. With the estimator in place, we formulate the problem of finding the optimal model portfolio as an integer linear programming (ILP) problem as follows. Given a set of M classifiers f1, f2, , f M, user queries X = {x1, x2, , x N}, pre-computed samples S1, S2, , SK, and budget B R+, we have the following ILP problem 6. j=1 c SP i(xj) ti,j s.t. j=1 bi ti,j B and i=1 ti,j = 1, for j = 1, 2, , N (4) where ti,j {0, 1} are boolean variables and ti,j = 1 iff the classifier fi is assigned to query xj. Clearly, the optimal model portfolio µ can be efficiently computed as µ (xj) = i iff t i,j = 1, for i [M] and j [N], where t i,j is the optimal solution to the ILP problem above. While ILP problems are NP-hard in general, we can use standard ILP solvers (e.g., Hi GHS (Huangfu & Hall, 2018)) to efficiently compute the optimal solution in practice. The optimization problem aims to maximize the estimated model portfolio accuracy and is subject to the risk of overestimation due to selection bias, especially on large-scale problems. Intuitively, a poor classifier with high-variance estimates can be mistakenly assigned to some queries if its performance on those queries is overestimated. We address this by regularizing the accuracy estimate for each classifier by the corresponding estimator variance. Specifically, we optimize the objective PM i=1 PN j=1(d SP i(xj) λ σi) ti,j in Equation (4), where σi is the standard deviation of the estimator d SP i. As σi is unknown a priori, we use a validation set to estimate σi for each classifier fi and tune λ for the highest validation accuracy. The overall algorithm is shown in Algorithm 1. We first pre-compute the ML classifier accuracy for all samples (line 1-3). Next, for each test query, we use its nearest neighbour from all samples to estimate the test accuracy of each classifier (line 4-6). In line 7, we compute the optimal model portfolio with the accuracy estimation subject to cost budgets by solving the ILP problem (see Equation (4)). 5We evaluate the nearest neighbour distance and estimation error in Appendices A.2 and A.3 6Our problem can be rephrased as selecting for each query, one item (i.e., ML classifier) from a collection (the set of all classifiers) so as to maximize the total value (accuracy) while adhering to a predefined weight limit (cost budget) , which is a classic multiple choice knapsack problem (MCKP) (Kellerer et al., 2004) and the ILP formulation is the natural choice. Published as a conference paper at ICLR 2025 Algorithm 1: OCCAM Algorithm. Input: test query batch X; ML classifiers f1, f2, , f M and costs c1, c2, , c M; query samples S1, S2, ..., Sk; user cost budget B. Output: optimal model portfolio µ : X [M]. // Pre-compute the ML classifier accuracy for all samples 1 for s S1 S2 Sk do 2 for i 1, 2, , M do 3 Sample Accuracy[s, i] = Accuracy(fi(s), ground truth(s)) // Use samples to estimate the accuracy for each test query 4 for x X do 5 for i 1, 2, , M do 6 Accuracy Estimation[x, i] = Mean({Sample Accuracy[Nearest Neighbour(x, Sj), i], for 1 j k}) // Compute the optimal model portfolio with the accuracy estimation subject to the cost budget (see Equation (4)) 7 µ = ILPSolver(Accuracy Estimation, c1, c2, , c M, B) 5 EVALUATION 5.1 EVALUATION SETUP Task. We consider the image classification task provided with its well-separation property (Yang et al., 2020): given an image, predict a class label from a set of predefined class categories. We assume that each image has a unique ground-truth class label. Datasets. We consider 4 widely studied datasets for image classification: CIFAR-10 (10 classes) (Krizhevsky et al., 2009), CIFAR-100 (100 classes) (Krizhevsky et al., 2009), Tiny Image Net (200 classes) (CS231n), and Image Net-1K (1000 classes) (Russakovsky et al., 2015). Details of those datasets are in Appendix C.1. Models. We consider a total of 7 classifiers: Res Net-[18, 34, 50, 101] (He et al., 2016)7 and Swin V2- [T, S, B] (Liu et al., 2022)8 Among these classifiers, Res Net-18 is the smallest (in terms of number of model parameters and training/inference time) and thus has the least capacity, while Swin V2-B is the largest and with the highest accuracy in general (see Figure 1a). We take the classifiers pre-trained on the Image Net-1K dataset (Russakovsky et al., 2015). We directly use the pre-trained models on Image Net-1K, while on other datasets, we freeze everything but train only the last layer from scratch. Model training details are in Appendix C.2. All experiments are conducted with one NVIDIA V100 GPU of 32GB GPU RAM. Codes are available in https://github.com/Dujian Ding/OCCAM.git. Inference Cost. The absolute costs of running a model may be expressed using a variety of metrics, including FLOPs, latency, dollars, etc. While FLOPs is an important metric that has the advantage of being hardware independent, it has been found to not correlate well with wall-clock latency, energy consumption, and dollar costs, which are of more practical interest to end users (Dao et al., 2022). In practice, dollar costs usually highly correlate with inference latency on GPUs. In our work, we define the cost of model inference in USD. We approximate the inference cost of computation by taking the cost per hour ($3.06) of the Azure Machine Learning (AML) NC6s v3 instance (Azure ML, 2024), as summarized in Table 1. The AML NC6s v3 instance contains a single V100 GPU and is commonly used for deep learning. Since CPU resources are significantly cheaper than GPU (e.g., D2s v3 instance, equipped with two 2 CPUs and no GPU, costs $0.096 per hour (Azure ML, 2024)) and all methods studied in this work typically finish in several CPU-seconds, incurring negligible expenses, we ignore the costs incurred by CPU in our comparison. In addition, since larger models typically have higher accuracy as well as higher costs (see Figure 1a), a practically interesting setting is to study how to deliver high quality answers with reduced costs in comparison to solely using the 7Numbers in bracket indicate the model s layer number. 8Letters in bracket indicate the Swin Transformer V2 s size. T/S/B means tiny/small/base. Published as a conference paper at ICLR 2025 Table 1: Model costs on the image classification task. Latency and prices are measured for 10,000 queries. Normalized cost is the fraction of the price w.r.t. Swin V2-B. Models Latency (s) Prices ($) Normalized Cost Res Net-18 88.9 0.076 0.15 Res Net-34 135.9 0.116 0.22 Res Net-50 174.5 0.148 0.29 Res Net-101 317.4 0.270 0.52 Swin V2-T 326.4 0.277 0.53 Swin V2-S 600.7 0.511 0.98 Swin V2-B 610.6 0.519 1 largest model (e.g., Swin V2-B). Normalized cost directly indicates the percentage cost saved and has been widely adopted in previous works (Ding et al., 2024; Kag et al., 2022), following which we report all results in terms of the normalized cost of each classifier. ILP Solver. While our approach is agnostic to the choice of the ILP solver, we choose the highperformance ILP solver, Hi GHS (Huangfu & Hall, 2018) to solve the problem in Equation (4), given its well-demonstrated efficiency and effectiveness on public benchmarks (Gleixner et al., 2021). In a nutshell, Hi GHS solves ILP problems with branch-and-cut algorithms (Fischetti & Monaci, 2020) and stops whenever the gap between the current solution and the global optimum is small enough (e.g., 1e-6). Baselines. We compare our approach with three baselines: single best, random, and Frugal MCT (Chen et al., 2022). Single best always chooses the strongest (i.e., most expensive) model for a given cost budget. Random estimates classifier accuracy with random guesses (i.e., uniform samples from [0, 1]) and solves the problem in Equation (4) with the same ILP solver as ours. Frugal MCT (Chen et al., 2022) is a recent work which selects the best ML models for given user budgets in an online setting, using model-based accuracy estimation. Following the same setting in (Chen et al., 2022), we train random forest regressors on top of the model-extracted features (e.g., Res Net-18 features), as the accuracy predictor. The predicted accuracy is used in Equation (4), which is solved by the same ILP solver as ours to make a fair comparison in terms of solution quality and costs. Our Method. We evaluate OCCAM (see Section 4) under various metrics (i.e., l1, l2, and l norms) and cost budgets. We consider images represented by model-based embeddings. Specifically, we extract the image feature 9 of the query image and all the validation images. The costs incurred by feature extraction are deducted from the user budget B before we compute the optimal model portfolio. We report the test accuracy under different cost budgets for OCCAM and all baselines in Section 5.2 (Figure 3 and Table 2), validate that OCCAM is cost-aware and indeed selecting the most profitable ML models to deliver high accuracy solutions in Section 5.3 (Figure 4a), demonstrate the effectiveness of OCCAM with limited samples and various λ values in Section 5.4 (Figures 4b and 4c), investigate the nearest neighbour distance in Appendix A.2, show that the estimation error of our accuracy estimator quickly decreases as the sample size increases in Appendix A.3, test the generalizability of OCCAM with different feature extractors in Appendix A.4, provide more performance results under different metrics (see Appendix A.5) and more classifiers (see Appendix A.6), show how to effectively choose K and s in Appendix A.7, perform overhead and upper bound performance analysis of OCCAM in Appendices A.8 and A.9, and examine the classifier complementarity in Appendix A.10. For simplicity, unless otherwise stated, we report OCCAM performance using Res Net-18 features and l metric with K = 40 for all datasets (s = 500 for CIFAR10, CIFAR100, and s = 1000 for Tiny Image Net, Image Net-1K). We choose λ = 100 for Image Net-1K and λ = 5 for all other datasets because Image Net-1K contains a high variety of image classes (1000 classes) that leads to relatively high estimation errors and requires more regularization penalty via large λ values. 5.2 PERFORMANCE RESULTS We investigate the test accuracy achieved by OCCAM and all baselines under different cost budgets and depict the results in Figure 3. We can see that by trading little to no accuracy drop, OCCAM achieves significant cost savings and outperforms all baselines across a majority of experiment 9The image feature is the last layer output of a ML model (e.g., Res Net-18) trained on the target dataset, given an input image. Published as a conference paper at ICLR 2025 Table 2: Cost reduction v.s. accuracy drop by OCCAM and baselines. Cost reduction and accuracy drops are computed w.r.t. using the single largest model (i.e., Swin V2-B) for all queries. For example, on Tiny Image Net, using Swin V2-B to classify all 10, 000 test images achieves an accuracy of 82.5% and incurs a total cost of $0.519 (we take it as the normalized cost 1, see Table 1). A 10% cost reduction equals a cost budget of $0.467 (i.e., a normalized cost 0.9), under which we evaluate the achieved accuracy of OCCAM and all baselines and report the relative accuracy drops. Accuracy Drop (%) Cost Reduction (%) CIFAR10 CIFAR100 Single Best Rand Frugal -MCT OCCAM Single Best Rand Frugal -MCT OCCAM 10 2.22 2.86 0.97 0.56 3.18 3.29 0.52 0.34 20 2.22 2.86 1.13 0.50 3.18 3.29 0.79 0.36 40 2.22 2.86 1.22 0.51 3.18 3.29 1.98 0.62 Cost Reduction (%) Tiny-Image Net-200 Image Net-1K Single Best Rand Frugal -MCT OCCAM Single Best Rand Frugal -MCT OCCAM 10 4.01 7.03 0.86 0.17 2.53 5.98 0.59 0.51 20 4.01 7.03 1.49 0.61 2.53 5.98 1.12 1.05 40 4.01 7.03 3.88 2.75 2.53 5.98 2.35 2.24 0.2 0.4 0.6 0.8 1.0 Normalized Cost Budget Accuracy (%) Single Best Random Frugal MCT OCCAM 0.2 0.4 0.6 0.8 1.0 Normalized Cost Budget Accuracy (%) Single Best Random Frugal MCT OCCAM 0.2 0.4 0.6 0.8 1.0 Normalized Cost Budget Accuracy (%) TINY-IMAGENET-200 Single Best Random Frugal MCT OCCAM 0.2 0.4 0.6 0.8 1.0 Normalized Cost Budget Accuracy (%) IMAGENET-1K Single Best Random Frugal MCT OCCAM Figure 3: Accuracy-cost tradeoffs achieved by OCCAM and baselines, for different cost budgets. settings. Results on cost reduction vs accuracy drop for all approaches are summarized in Table 2. On easy classification task (CIFAR-10 of 10 classes), OCCAM consistently outperforms all baselines by achieving 40% cost reduction with up to 0.56% accuracy drop. Cost reduction and accuracy drop are computed w.r.t. using the strongest model (i.e., Swin V2-B) for all queries. On moderate classification task (CIFAR-100 of 100 classes), OCCAM outperforms all baselines by trading up to 0.62% accuracy drop for 40% cost reduction. On hard classification task (Tiny Image Net of 200 classes), OCCAM significantly outperforms all three baselines with at least 0.5% higher accuracy. Notably, on aggressive cost regimes (e.g., 40% cost reduction), the achieved accuracy of OCCAM is 1.1% higher than Frugal MCT, 4.3% higher than random, and 1.3% higher than single best. On the most challenging classification task (Image Net-1K of 1000 classes), OCCAM still consistently outperforms all three baselines with higher accuracy at all cost budget levels. We believe that the above results demonstrate the generalized effectiveness of OCCAM in achieving non-trivial cost reduction for a small accuracy drop on classification tasks of different difficulty levels. 5.3 VALIDATION RESULTS We validate that OCCAM is functioning as intended, that is, it does select small-yet-profitable classifiers when budgets are limited and gradually switches to large-but-accurate classifiers as cost budgets increase. In Figure 4a we plot the model usage for each classifier under different cost budgets on the Tiny Image Net dataset. From the figure, it can be seen that when cost budgets are restricted, OCCAM mainly chooses Res Net-18 to resolve queries given its low costs and good accuracy (as seen in Table 1 and Figure 1a). As budgets increase, OCCAM gradually switches to Swin V2-S and Swin V2-B, given their predominantly high accuracy (82% as seen in Figure 1a). 5.4 STABILITY ANALYSIS OCCAM pre-computes K labelled samples of size s to estimate the test accuracy at inference time and computes the optimal model portfolio by solving Equation (4) with regularized accuracy estimation, Published as a conference paper at ICLR 2025 d SP i(x) λ σi (see Section 4.2). We investigate OCCAM performance with different total sample sizes (K s) by setting s = 1000 and changing K from 10 to 40 (see Figure 4b), and examine the stability of OCCAM to the choice of λ (see Figure 4c). We report results on Tiny Image Net dataset. 0.2 0.4 0.6 0.8 1.0 Normalized Cost Budget Model Usage (%) TINY-IMAGENET-200 resnet18 resnet34 resnet50 resnet101 swin_v2_t swin_v2_s swin_v2_b (a) Model usage. 10000 20000 30000 40000 Total Sample Size Accuracy (%) TINY-IMAGENET-200 OCCAM (B=0.8) Frugal MCT (B=0.8) OCCAM (B=0.6) Frugal MCT (B=0.6) (b) Sample size. Accuracy (%) TINY-IMAGENET-200 OCCAM (B=0.8) Frugal MCT (B=0.8) OCCAM (B=0.6) Frugal MCT (B=0.6) (c) λ sensitivity. Figure 4: (a) Model usage breakdown by OCCAM under different cost budgets, (b) OCCAM accuracy with different sample sizes, and (c) OCCAM performance with different λ choices In Figure 4b, we plot the achieved accuracy of OCCAM under different total sample sizes (K s) and normalized cost budgets (B). We also report Frugal MCT performance using a maximum of 40, 000 sampled images to train its accuracy predictor. With budget B = 0.8 (i.e., 20% cost reduction), OCCAM achieves comparable performance to Frugal MCT at 25% samples and continues to outperform Frugal MCT as the total sample size increases. With budget B = 0.6 (i.e., 40% cost reduction), OCCAM outperforms Frugal MCT by 0.7% higher accuracy with only 25% samples and achieves up to 1.3% higher accuracy as the total sample size increases, which demonstrates the sustained effectiveness of OCCAM even with limited samples. In Figure 4c, we plot the achieved accuracy of OCCAM with different λ choices and cost budgets (B). Compared to Frugal MCT, OCCAM achieves up to 1.3% higher accuracy when budget B = 0.6, and 0.8% higher accuracy when B = 0.8. Notably, OCCAM outperforms Frugal MCT under all settings even when λ = 0 and always achieves higher accuracy when λ > 0, which justifies the preserved effectiveness of OCCAM even when λ is not well-tuned. 6 LIMITATIONS OCCAM assumes well-separated tasks. OCCAM assumes that the classification task is well separated (see Section 4.1), meaning intuitively that instances (e.g., images) of the problem should not sharply change their class labels under minor modification. To adapt OCCAM to other classification tasks such as sentiment analysis (Medhat et al., 2014), the challenge is how to choose the most suitable numeric representation so that the separation property is preserved. OCCAM assumes IID data. The validity of our theoretical guarantees relies on the assumption that underlying data is independent and identically distributed (IID). In our evaluation, we uniformly sample training and validation splits from the same population to ensure the IID assumption. It is an interesting question how to adapt our approach to accommodate data drifts or out-of-distribution data. OCCAM optimizes the average accuracy. In sensitive applications such as healthcare or autonomous driving, optimizing the average accuracy may not be adequate. Instead, optimizing the min accuracy is more appropriate. We can achieve this by changing the ILP objective (Equation (4)) to maximize the min accuracy of all queries, that is, min{PM i=1 d SP i(xj) ti,j, for 1 j N}. 7 CONCLUSION Motivated by the need to optimize the classifier assignment to different classification queries with pre-defined cost budgets, we have formulated the optimal model portfolio problem and proposed a principled approach, Optimization with Cost Constraints for Accuracy Maximization (OCCAM), to effectively deliver high accuracy solutions. We present an unbiased and low-variance estimator for classifier test accuracy with asymptotic guarantees, and compute an optimal classifier assignment with novel regularization techniques mitigating overestimation risks. Our experimental results on a variety of real-world classification datasets show that we can achieve up to 40% cost reduction with no significant drop in classification accuracy. Published as a conference paper at ICLR 2025 ACKNOWLEDGMENTS The authors would like to thank Yuxi Feng and Chenyang Tao for helpful discussions. This work is supported in part by the Institute for Computing, Information and Cognitive Systems (ICICS) at UBC. Research by the first and last authors was supported by a grant from the Natural Sciences and Engineering Research Council of Canada (NSERC), Grant Number RGPIN-2020-05408. Barsha Abhisheka, Saroj Kumar Biswas, Biswajit Purkayastha, Dolly Das, and Alexandre Escargueil. Recent trend in medical imaging modalities and their applications in disease diagnosis: a review. Multimedia Tools and Applications, 83(14):43035 43070, 2024. Yara Alghofaili, Albatul Albattah, and Murad A Rassam. A financial fraud detection model based on lstm deep learning technique. Journal of Applied Security Research, 15(4):498 516, 2020. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425 2433, 2015. Azure ML. Azure machine learning pricing, February 2024. URL https://azure.microsoft. com/en-ca/pricing/details/machine-learning/. Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems, 30, 2017. Jinyuan Chang, Xiaohui Chen, and Mingcong Wu. Central limit theorems for high dimensional dependent data. Bernoulli, 30(1):712 742, 2024. Lingjiao Chen, Matei Zaharia, and James Y Zou. Frugalml: How to use ml prediction apis more accurately and cheaply. Advances in neural information processing systems, 33:10685 10696, 2020. Lingjiao Chen, Matei Zaharia, and James Zou. Efficient online ml api selection for multi-label classification tasks. In International Conference on Machine Learning, pp. 3716 3746. PMLR, 2022. Stanford CS231n. Tiny imagenet dataset. URL http://cs231n.stanford.edu/ tiny-imagenet-200.zip. Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R e. Flashattention: Fast and memoryefficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344 16359, 2022. Dujian Ding, Ganesh Jawahar, and Laks VS Lakshmanan. Pass: Pruning attention heads with almost-sure sparsity targets. Transactions on Machine Learning Research. Dujian Ding, Sihem Amer-Yahia, and Laks Lakshmanan. On efficient approximate queries over machine learning models. Proceedings of the VLDB Endowment, 16(4):918 931, 2022. Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor R uhle, Laks V. S. Lakshmanan, and Ahmed Hassan Awadallah. Hybrid LLM: Cost-efficient and quality-aware query routing. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=02f3m Utqn M. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020. Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. The Journal of Machine Learning Research, 20(1):1997 2017, 2019. Published as a conference paper at ICLR 2025 Kenneth Eriksson, Donald Estep, Claes Johnson, Kenneth Eriksson, Donald Estep, and Claes Johnson. Lipschitz continuity. Applied Mathematics: Body and Soul: Volume 1: Derivatives and Geometry in IR 3, pp. 149 164, 2004. William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1 39, 2022. Matteo Fischetti and Michele Monaci. A branch-and-cut algorithm for mixed-integer bilinear programming. European Journal of Operational Research, 282(2):506 514, 2020. Ambros Gleixner, Gregor Hendel, Gerald Gamrath, Tobias Achterberg, Michael Bastubbe, Timo Berthold, Philipp M. Christophel, Kati Jarck, Thorsten Koch, Jeff Linderoth, Marco L ubbecke, Hans D. Mittelmann, Derya Ozyurt, Ted K. Ralphs, Domenico Salvagnin, and Yuji Shinano. MIPLIB 2017: Data-Driven Compilation of the 6th Mixed-Integer Programming Library. Mathematical Programming Computation, 2021. doi: 10.1007/s12532-020-00194-3. URL https://doi.org/10.1007/s12532-020-00194-3. Google. Cloud vision pricing. https://cloud.google.com/vision/pricing. Henry Gouk, Eibe Frank, Bernhard Pfahringer, and Michael J Cree. Regularisation of neural networks by enforcing lipschitz continuity. Machine Learning, 110:393 416, 2021. Babak Hassibi, David G Stork, and Gregory J Wolff. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pp. 293 299. IEEE, 1993. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2(7), 2015. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700 4708, 2017. Qi Huangfu and JA Julian Hall. Parallelizing the dual revised simplex method. Mathematical Programming Computation, 10(1):119 142, 2018. Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2704 2713, 2018. Anil Kag, Igor Fedorov, Aditya Gangrade, Paul Whatmough, and Venkatesh Saligrama. Efficient edge inference by selective query. In The Eleventh International Conference on Learning Representations, 2022. Hans Kellerer, Ulrich Pferschy, David Pisinger, Hans Kellerer, Ulrich Pferschy, and David Pisinger. The multiple-choice knapsack problem. Knapsack Problems, pp. 317 347, 2004. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann Le Cun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http: //arxiv.org/abs/1412.6980. Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura Barnes, and Donald Brown. Text classification algorithms: A survey. Information, 10(4):150, 2019. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012. Published as a conference paper at ICLR 2025 Yann Le Cun, John Denker, and Sara Solla. Optimal brain damage. Advances in neural information processing systems, 2, 1989. Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012 10022, 2021. Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12009 12019, 2022. Walaa Medhat, Ahmed Hassan, and Hoda Korashy. Sentiment analysis algorithms and applications: A survey. Ain Shams engineering journal, 5(4):1093 1113, 2014. Riccardo Miotto, Fei Wang, Shuang Wang, Xiaoqian Jiang, and Joel T Dudley. Deep learning for healthcare: review, opportunities and challenges. Briefings in bioinformatics, 19(6):1236 1246, 2018. Xianbiao Qi, Jianan Wang, Yihao Chen, Yukai Shi, and Lei Zhang. Lipsformer: Introducing lipschitz continuity to vision transformers. ar Xiv preprint ar Xiv:2304.09856, 2023. John R Rice. The algorithm selection problem. In Advances in computers, volume 15, pp. 65 118. Elsevier, 1976. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211 252, 2015. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ar Xiv preprint ar Xiv:1701.06538, 2017. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014. Ankita Singh and Pawan Singh. Image classification: a survey. Journal of Informatics Electrical and Electronics Engineering (JIEEE), 1(2):1 9, 2020. Dong Su, Huan Zhang, Hongge Chen, Jinfeng Yi, Pin-Yu Chen, and Yupeng Gao. Is robustness the cost of accuracy? a comprehensive study on the robustness of 18 deep image classification models. In Proceedings of the European conference on computer vision (ECCV), pp. 631 648, 2018. Mou Sun, Tao Li, and Wotao Yin. Mindopt adapter for cplex benchmarking performance analysis. ar Xiv preprint ar Xiv:2312.13527, 2023. Jigang Tang, Songbin Li, and Peng Liu. A review of lane detection methods based on deep learning. Pattern Recognition, 111:107623, 2021. Gregor Urban, Krzysztof J Geras, Samira Ebrahimi Kahou, Ozlem Aslan, Shengjie Wang, Rich Caruana, Abdelrahman Mohamed, Matthai Philipose, and Matt Richardson. Do deep convolutional nets really need to be deep and convolutional? ar Xiv preprint ar Xiv:1603.05691, 2016. Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networks on cpus. 2011. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156 3164, 2015. Rongxia Wang, Malik Bader Alazzam, Fawaz Alassery, Ahmed Almulihi, and Marvin White. Innovative research of trajectory prediction algorithm based on deep learning in car network collision detection and early warning system. Mobile information systems, 2021:1 8, 2021. Published as a conference paper at ICLR 2025 Yao-Yuan Yang, Cyrus Rashtchian, Hongyang Zhang, Russ R Salakhutdinov, and Kamalika Chaudhuri. A closer look at accuracy vs. robustness. Advances in neural information processing systems, 33:8588 8601, 2020. Wan Zhu, Longxiang Xie, Jianye Han, and Xiangqian Guo. The application of deep learning in cancer prognosis prediction. Cancers, 12(3):603, 2020. Yilun Zhu, Jianxin Zhang, Aditya Gangrade, and Clayton Scott. Label noise: Ignorance is bliss. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=f TKcqr4xu X. Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. ar Xiv preprint ar Xiv:1611.01578, 2016. Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey. Proceedings of the IEEE, 111(3):257 276, 2023. Published as a conference paper at ICLR 2025 dist(x,x')=0.99 dist(x,x')=0.74 dist(x,x')=0.47 Figure 5: Intuitive example of well separated images from Tiny Image Net under l distance metric using Res Net-18 features. Images from different classes (e.g., goldfish and bullfrog ) are typically well separated by a non-zero distance. A ADDITIONAL EXPERIMENTS A.1 REAL IMAGE DATASETS ARE WELL SEPARATED. In (Yang et al., 2020), authors have shown that many real image classification tasks comprise of separated classes in RGB-valued space. In this section, we provide further empirical evidence to show that real image datasets (e.g., Tiny Image Net) are well separated (see Definition 4.1) in different feature spaces under various metrics (Figures 5 and 6). In Figure 5, we provide an intuitive example to illustrate that images from different classes (e.g., goldfish and bullfrog ) are typically well separated by a non-zero distance. In Figure 6, we investigate the distance distribution for images of different classes from Tiny Image Net (200 classes). We observe that images of different classes are typically far from each other by a non-zero distance under different metrics (e.g., l1, l2, and l ) in different feature spaces (e.g., image features extracted by Res Net-18, Res Net-50, and Swin V2-T). In addition, we note that real image datasets are subject to little to no label noises (Zhu et al., 2024). For example, on Tiny Image Net, we investigate 40, 000 images from the training split and only find 4 duplicate images of different class labels. We also consider more standard image datasets (see Section 5.1). It turns out that CIFAR-10 contains no label noise, CIFAR-100 contains 3 duplicate images of different class labels (out of 20, 000 images), and the noise frequency on Image Net-1K is 8 out of 40, 000 images. Our observation suggests that standard image datasets are quite clean (aligned with the observation in (Yang et al., 2020)) that justifies the adoption of well-separation assumption. A.2 NEAREST NEIGHBOUR DISTANCE APPROACHES 0 AS SAMPLE SIZE INCREASES. In this section, we conduct experiments to investigate the changes of nearest neighbour distance (dist(x, NNS(x))) as sample size (s) increases. We report results using different feature extractors (Res Net-18, Res Net-50, and Swin V2-T) as well as different metrics (l1, l2, and l ) on the validation split of Tiny Image Net dataset (Figure 7). It can be clearly seen in Figure 7 that the distance to the sampled nearest neighbour quickly approaches 0 as sample size increases. This could be attributable to the fact that we are sampling from real images. With properly pre-trained feature extractors, the possible image embeddings could be restricted to a subspace rather than pervade the whole high-dimensional space, which can significantly reduce the required number of samples and give us meaningfully small distances to the sampled nearest neighbours. Another interesting observation is that, in all investigated feature space, l always provides the smallest nearest neighbour distance with different sample sizes, followed by l2 and l1. Such distinction mainly results from the fact that we use normalized image features where each dimension of the feature vector x is between 0 and 1, that is, 0 x[i] 1 for any x[i] x. Consequently, Published as a conference paper at ICLR 2025 0.00 0.25 0.50 0.75 1.00 Distance (l ) Feature Extractor: resnet18 0.00 0.25 0.50 0.75 1.00 Distance (l ) Feature Extractor: resnet50 0.00 0.25 0.50 0.75 1.00 Distance (l ) Feature Extractor: swin_v2_t 0.0 0.5 1.0 1.5 2.0 Distance (l1) Feature Extractor: resnet18 0.0 0.5 1.0 1.5 2.0 Distance (l1) Feature Extractor: resnet50 0.0 0.5 1.0 1.5 2.0 Distance (l1) Feature Extractor: swin_v2_t 0.0 0.5 1.0 Distance (l2) Feature Extractor: resnet18 0.0 0.5 1.0 Distance (l2) Feature Extractor: resnet50 0.0 0.5 1.0 Distance (l2) Feature Extractor: swin_v2_t Figure 6: Distance distribution between images of different classes from Tiny Imagenet. We consider image representation derived by different feature extractors (Res Net-18, Res Net-50, and Swin V2-T) as well as different metrics (l1, l2, and l ). Images of different classes are typically far from each other by non-zero distances. 200 400 600 800 1000 Sample Size (s) Nearest Neighbour Dist. Feature Extractor: resnet18 200 400 600 800 1000 Sample Size (s) Nearest Neighbour Dist. Feature Extractor: resnet50 200 400 600 800 1000 Sample Size (s) Nearest Neighbour Dist. Feature Extractor: swin_v2_t Figure 7: Nearest neighbour distance quickly approaches 0 as the sample size increases using different image feature extractors (Res Net-18, Res Net-50, and Swin V2-T) and metrics (l1, l2, and l ). Published as a conference paper at ICLR 2025 200 400 600 800 1000 Sample Size (s) Estimation Error Feature Extractor: resnet18 resnet18 resnet34 resnet50 resnet101 swin_v2_t swin_v2_s swin_v2_b 200 400 600 800 1000 Sample Size (s) Estimation Error Feature Extractor: resnet50 resnet18 resnet34 resnet50 resnet101 swin_v2_t swin_v2_s swin_v2_b 200 400 600 800 1000 Sample Size (s) Estimation Error Feature Extractor: swin_v2_t resnet18 resnet34 resnet50 resnet101 swin_v2_t swin_v2_s swin_v2_b Figure 8: Estimation error for each ML classifier quickly decreases as the sample size increases using different image feature extractors (Res Net-18, Res Net-50, and Swin V2-T) under l metrics. Table 3: Cost reduction v.s. accuracy drop by baselines and OCCAM using different feature extractors (Res Net-18, Res Net-50, and Swin V2-T) and l distance metric. Cost reduction and accuracy drops are computed w.r.t. using the single largest model (i.e., Swin V2-B) for all queries. Accuracy Drop (%) Cost Reduction (%) Tiny-Image Net-200 Single Best Rand Frugal MCT (Res Net-18) Frugal MCT (Res Net-50) Frugal MCT (Swin V2-T) OCCAM (Res Net-18) OCCAM (Res Net-50) OCCAM (Swin V2-T) 10 4.01 7.03 0.86 0.84 1.18 0.48 0.40 0.29 20 4.01 7.03 1.49 1.45 1.60 1.02 0.74 0.58 40 4.01 7.03 3.88 4.12 3.22 3.24 2.56 2.81 we have the inequality that the l (x) = max{|x[i]||x[i] x} l2(x) = q P x[i] x |x[i]|2 x[i] x |x[i]|. Recall that the OCCAM employs the classifier accuracy estimator which is asymptotically unbiased as nearest neighbour distance approaches 0. The above observation suggests that l is likely to provide smaller nearest neighbour distance and reduce the estimation error that leads to higher overall performance, especially in scenarios when sampling is expensive or labelled data is scarce. A.3 ESTIMATION ERROR DECREASES AS SAMPLE SIZE INCREASES. In this section, we investigate the estimation error (difference between real classifier accuracy and our estimator results) for different ML classifiers, using different feature extractors (Res Net-18, Res Net-50, and Swin V2-T). For brevity, on Tiny Image Net, we report the estimation error in the accuracy of all 7 classifiers (Res Net-[18, 34, 50, 101], and Swin V2-[T, S, B]), under l metric (Figure 8). The patterns are similar with other metrics and feature extractors. It is clear from Figure 8 that the estimation error of our accuracy estimator continues to decrease for all ML classifiers as the sample size increases, which demonstrates the effectiveness our accuracy estimator design (see Section 4.1). A.4 GENERALIZING TO DIFFERENT FEATURE EXTRACTORS We further report the performance of OCCAM with different feature extractors (Res Net-18, Res Net50, and Swin V2-T), on Tiny Image Net. As in illustrated in Section 5.1, the costs incurred by feature extraction are deducted from the user budget before we compute the optimal model portfolio . Results are summarized in Table 3. It can be seen that OCCAM outperforms all baselines on all experimental settings, which demonstrates the effectiveness and generalizability of OCCAM with different feature extractors. Published as a conference paper at ICLR 2025 Table 4: Cost reduction v.s. accuracy drop by OCCAM and baselines using Res Net-18 features and l1 distance metric. Cost reduction and accuracy drops are computed w.r.t. using the single largest model (i.e., Swin V2-B) for all queries. Accuracy Drop (%) Cost Reduction (%) CIFAR10 CIFAR100 Single Best Rand Frugal -MCT OCCAM Single Best Rand Frugal -MCT OCCAM 10 2.22 2.86 0.97 0.38 3.18 3.29 0.52 0.50 20 2.22 2.86 1.13 0.38 3.18 3.29 0.79 0.50 40 2.22 2.86 1.22 0.37 3.18 3.29 1.98 0.99 Cost Reduction (%) Tiny-Image Net-200 Image Net-1K Single Best Rand Frugal -MCT OCCAM Single Best Rand Frugal -MCT OCCAM 10 4.01 7.03 0.86 0.48 2.53 5.98 0.59 0.86 20 4.01 7.03 1.49 1.02 2.53 5.98 1.12 1.51 40 4.01 7.03 3.88 3.24 2.53 5.98 2.35 3.32 0.2 0.4 0.6 0.8 1.0 Normalized Cost Budget Accuracy (%) Single Best Random Frugal MCT OCCAM 0.2 0.4 0.6 0.8 1.0 Normalized Cost Budget Accuracy (%) Single Best Random Frugal MCT OCCAM 0.2 0.4 0.6 0.8 1.0 Normalized Cost Budget Accuracy (%) TINY-IMAGENET-200 Single Best Random Frugal MCT OCCAM 0.2 0.4 0.6 0.8 1.0 Normalized Cost Budget Accuracy (%) IMAGENET-1K Single Best Random Frugal MCT OCCAM Figure 9: Accuracy-cost tradeoffs achieved by OCCAM and baselines using l1 metric and Res Net-18 features, for different cost budgets. A.5 MORE OCCAM PERFORMANCE RESULTS. In this section, we provide more OCCAM performance results using l1 and l2 norm metrics, as shown in Figures 9 and 10. Qualitative comparison results are summarized in Tables 4 and 5, which resemble our analysis in Section 5.2. Typically, by trading little to no performance drop, OCCAM can achieve significant cost reduction and outperform all baselines across a majority of experiment settings. However, we also note that Frugal MCT can sometimes outperform OCCAM on Image Net-1K using l1 and l2 metrics, while OCCAM outperforms Frugal MCT across all experiment settings using l metric (see Section 5.2). This could be explained by the fact that l1 and l2 metrics are likely to provide higher nearest neighbour distance than l metric (see Appendix A.2) that implicitly increases OCCAM estimator error and leads to reduced overall performance, especially when the classification task is challenging and labelled data is scarce. Provided that, in practice, we would recommend applying OCCAM with l to achieve significant cost reduction with little to no performance drop (see Section 5.2). A.6 LARGE-SCALE EXPERIMENTS WITH 101 PRE-TRAINED CLASSIFIERS. In this section, we investigate the capacity of OCCAM on Image Net-1K dataset with 101 pre-trained models for image classification (e.g., Dense Net (Huang et al., 2017), Res Net (He et al., 2016) to VGG (Simonyan & Zisserman, 2014), Vi T (Dosovitskiy et al., 2020) to Swin Transformer V2 (Liu et al., 2022)) that are available on Pytorch Hub 10. We plot the validation accuracy and costs of the 101 pre-trained models in Figure 11. It can be clearly seen that a majority of pre-trained models are Paretodominated by others, that is, there exist models which are more accurate yet cheaper. The subset of models that are not Pareto-dominated by any other models are termed as Pareto-efficient models. In practice, we can always replace a Pareto-dominated model with a corresponding Pareto-efficient 10https://pytorch.org/vision/main/models.html Published as a conference paper at ICLR 2025 Table 5: Cost reduction v.s. accuracy drop by OCCAM and baselines using Res Net-18 features and l2 distance metric. Cost reduction and accuracy drops are computed w.r.t. using the single largest model (i.e., Swin V2-B) for all queries. Accuracy Drop (%) Cost Reduction (%) CIFAR10 CIFAR100 Single Best Rand Frugal -MCT OCCAM Single Best Rand Frugal -MCT OCCAM 10 2.22 2.86 0.97 0.24 3.18 3.29 0.52 0.34 20 2.22 2.86 1.13 0.25 3.18 3.29 0.79 0.40 40 2.22 2.86 1.22 0.27 3.18 3.29 1.98 0.71 Cost Reduction (%) Tiny-Image Net-200 Image Net-1K Single Best Rand Frugal -MCT OCCAM Single Best Rand Frugal -MCT OCCAM 10 4.01 7.03 0.86 0.21 2.53 5.98 0.59 1.06 20 4.01 7.03 1.49 0.81 2.53 5.98 1.12 1.65 40 4.01 7.03 3.88 2.75 2.53 5.98 2.35 3.10 0.2 0.4 0.6 0.8 1.0 Normalized Cost Budget Accuracy (%) Single Best Random Frugal MCT OCCAM 0.2 0.4 0.6 0.8 1.0 Normalized Cost Budget Accuracy (%) Single Best Random Frugal MCT OCCAM 0.2 0.4 0.6 0.8 1.0 Normalized Cost Budget Accuracy (%) TINY-IMAGENET-200 Single Best Random Frugal MCT OCCAM 0.2 0.4 0.6 0.8 1.0 Normalized Cost Budget Accuracy (%) IMAGENET-1K Single Best Random Frugal MCT OCCAM Figure 10: Accuracy-cost tradeoffs achieved by OCCAM and baselines using l2 metric and Res Net-18 features, for different cost budgets. model to improve accuracy and reduce costs simultaneously, which serve as the core set of models to choose from. Specifically, there are 12 Pareto-efficient models from the 101 pre-trained models as summarized in in Table 6. In our evaluation, we examine OCCAM with the 12 Pareto-efficient models and compare it with baselines of access to all 101 pre-trained models. Note that Frugal MCT does not finish after running 5 hours and we exclude it from the comparison. Results are shown in Figure 12 and Table 7. OCCAM consistently outperforms all baselines by achieving less than 1% accuracy drop at various cost reduction rates, while the accuracy drop for Random is up to 7.05%, and 15.09% for Single Best. It is worth noting that, Single Best, as defined in 5.1, always chooses the most expensive model for a given cost budget. As shown in Figure 11, in practice, more expensive models are not necessarily of higher accuracy. The performance of Single Best oscillates heavily due to the existence of expensive-and-inaccurate models, which suggests that blindly using hundreds of classifiers in practice may not eventually help the overall performance. A.7 EFFECTS OF K AND S ON OCCAM PERFORMANCE. In this section, we examine the effects of K and s on the performance of OCCAM (see Figure 13). As shown by Lemmas 4.5 and 4.6, our estimator is asymptotically unbiased and low-variance as both K (number of samples) and s (sample size) increase, which leads to better overall performance. We demonstrate this by increasing K from 10 to 40 while keeping s fixed (1000), and increasing s from 400 to 1,000 while keeping K fixed (40), as illustrated in Figures 13a and 13b. We observe that the performance of OCCAM consistently dominates that of Frugal MCT and improves as both K and s increase. In our evaluation, we pre-compute a held-out dataset (e.g., for Tiny Image Net, we uniformly sample 40,000 images from the training split) from which we draw K samples of size s. With the maximal number of samples bounded (for Tiny Image Net, K s 40000), there is a trade-off between increasing K and having larger s. As shown in Figure 13c, though a larger K typically leads to better accuracy, it also limits the maximal value that s can take (s 40000/K), Published as a conference paper at ICLR 2025 Accuracy (%) All Pareto-efficient Figure 11: Accuracy v.s. cost for the 101 pre-trained classifiers on Image Net-1K from Pytorch Hub. Table 6: Accuracy and costs of the 12 Pareto-efficient models on the image classification task. Latency and prices are measured for 10,000 queries. Normalized cost is the fraction of the price w.r.t. Reg Net Y 128GF (the most expensive model). Models Accuracy (%) Latency (s) Prices ($) Normalized Cost Res Net-18 67.1 75.5 0.064 0.14 VGG19 70.6 82.7 0.070 0.15 VGG19 BN 73.0 86.8 0.074 0.16 MNASNet1 3 75.7 113.0 0.096 0.21 Reg Net X 800MF 76.7 140.6 0.120 0.26 Vi T B 16 81.6 149.6 0.127 0.27 Conv Ne Xt Tiny 82.0 184.6 0.157 0.34 Reg Net X 16GF 82.2 255.1 0.217 0.46 Reg Net X 32GF 82.3 259.3 0.220 0.47 Vi T L 16 84.7 263.2 0.224 0.48 Vi T H 14 85.5 414.1 0.352 0.75 Reg Net Y 128GF 86.0 549.7 0.467 1.00 0.2 0.4 0.6 0.8 1.0 Normalized Cost Budget Accuracy (%) IMAGENET-1K Single Best Random OCCAM Figure 12: Accuracy-cost tradeoffs by OCCAM and baselines using l metric and Res Net-18 features with the 101 pre-trained classifiers on Image Net-1K. Published as a conference paper at ICLR 2025 Table 7: Cost reduction v.s. accuracy drop by baselines and OCCAM using l metric and Res Net-18 features with the 101 pre-trained classifiers on Image Net-1K. Cost reduction and accuracy drops are computed w.r.t. using the most expensive model (i.e., Reg Net Y 128GF) for all queries. Frugal MCT does not finish after 5 hours and is excluded from the comparison. Accuracy Drop (%) Cost Reduction (%) Image Net-1K Single Best Rand Frugal MCT OCCAM 10 15.09 7.05 N/A 0.17 20 3.27 7.05 N/A 0.37 40 9.15 7.05 N/A 0.80 10 20 30 40 K (s=1000) Accuracy (%) TINY-IMAGENET-200 OCCAM (B=0.8) Frugal MCT (B=0.8) OCCAM (B=0.6) Frugal MCT (B=0.6) 400 600 800 1000 s (K=40) Accuracy (%) TINY-IMAGENET-200 OCCAM (B=0.8) Frugal MCT (B=0.8) OCCAM (B=0.6) Frugal MCT (B=0.6) 10 20 30 40 50 K Accuracy (%) TINY-IMAGENET-200 s=1200 s=1000 s=800 Figure 13: OCCAM performance improves as (a) the number of samples K and (b) the sample size s increase. (c) In practice, when the total sample size is fixed (e.g., 40,000 for Tiny Image Net), there is a trade-off between increasing K and having larger s. which results in sub-optimal performance. We empirically choose K and s that give us the best overall accuracy across different datasets, as discussed in Section 5.1. A.8 OVERHEAD ANALYSIS In this section, we investigate the overheads incurred by OCCAM, which mainly result from nearest neighbour search and the use of the ILP solver, on the Tiny Image Net dataset. With 10,000 test images and 40,000 total samples, the nearest neighbor search takes 8.68 seconds to return all results, up to two orders of magnitude smaller than the model inference time (see Table 1), which leads to negligible overheads. Such efficiency can be attributed to the linear time complexity of nearest neighbor search w.r.t. the total sample sizes. Specifically, let N denote the number of test images, s denote the sample size, K denote the number of samples, and d denote the image representation dimension; then the time complexity of nearest neighbor search is O(N s K d). In our evaluation, we adopt Hi GHS (Huangfu & Hall, 2018) as our ILP solver given its welldemonstrated efficiency and effectiveness on public benchmarks (Gleixner et al., 2021). With 10,000 test images and 7 classifiers (equivalently an ILP instance with 70,000 variables and constraints, see Equation (4)), the Hi GHS ILP solvers takes 15.1 seconds to return the optimal assignment. It is worth noting that the latency overhead of ILP solver is only fractional to using the smallest model (Res Net-18 takes 88.9s to process 10,000 test images, see Table 1), and even less than 2.5% of always using the largest model (Swin V2-B takes 610.6s to process 10,000 test images, see Table 1), which demonstrates the overhead incurred by ILP solver is negligible. We note that there are other ILP solvers which are even more efficient than Hi GHS. For example, Gurobi 11 is up to 20X faster than Hi GHS on large scale MILP instances with up to 640K variables and constraints (Sun et al., 2023), which can be used as the ILP solver if the problem scale increases further. 11https://www.gurobi.com/ Published as a conference paper at ICLR 2025 0.2 0.4 0.6 0.8 1.0 Normalized Cost Budget Accuracy (%) TINY-IMAGENET-200 Single Best Random Frugal MCT OCCAM Optimal Figure 14: Accuracy-cost tradeoffs by OCCAM and baselines using l metric and Res Net-18 features. Optimal is computed using the true classification likelihood instead of estimation. Overall, the latency overheads incurred by OCCAM are negligible. Since OCCAM only requires CPU support to compute the optimal model assignment, the monetary overheads are significantly small as well. A.9 UPPER BOUND PERFORMANCE OF OCCAM. In this section, we investigate the optimal performance that can be reached with true classification likelihood (no estimation errors) on the Tiny Image Net dataset (see Figure 14). Note that, the underlying method of OCCAM and the Optimal is exactly the same. The performance difference between OCCAM and Optimal mainly comes from the estimation error and hints at a huge room for further improvements which we will explore in our future work. A.10 COMPLEMENTARITY BETWEEN DIFFERENT CLASSIFIERS. In this section, we examine the complementary behaviour between different classifiers, that is, different classifiers may suit the best for different subset of test queries. On the Tiny Image Net dataset, we plot the frequency that one classifier makes the right prediction while the other fails (see Figure 15). For example, on 5% of test queries, Res Net-18 is able to correctly classify the labels on which Swin V2-B fails, while on the other 24% of test queries, Swin V2-B is better than Res Net-18 in terms of making right predictions. The complementary nature of different classifiers implies the possibility of developing a model assignment strategy that effectively outperforms the single best classifier such as the Optimal approach shown in Figure 14. It can be seen that the Optimal performance significantly outperforms the Single Best baseline and suggests a huge room for further improvements by reducing the accuracy estimation errors. In this section, we provide proofs to Lemmas 4.3 to 4.6. Proof to Lemma 4.3 Proof. The proof is straightforward. Without loss of generality, we consider the l1 metric and assume (X, l1) is a r-separated metric space. For brevity, we abuse the notation and let O(x) denote the one-hot output distribution over all labels. For any x, x X, if x and x belong to the same class, then O(x) O(x ) 1 = 0 2 r x x 1; otherwise, O(x) O(x ) 1 = 2 2 r x x 1. The Lipschitiz constant for O is 2 Proof to Lemma 4.4 Published as a conference paper at ICLR 2025 0.0 0.100.080.070.060.05 0.05 0.16 0.0 0.100.080.070.05 0.05 0.16 0.12 0.0 0.070.070.05 0.05 0.19 0.140.12 0.0 0.080.06 0.05 0.22 0.160.140.11 0.0 0.04 0.04 0.24 0.190.170.140.09 0.0 0.05 0.24 0.180.170.130.090.05 0.0 Figure 15: Classifier complementarity. Each entry indicates the percentage of queries on which the classifier on the row makes the right prediction while the classifier on the column fails. Proof. Similarly, without loss of generality, we consider the l1 metric and let fi(x), O(x) denote the output distribution over all labels. Let Li and LO denote the Lipschitz constants for fi(x) and O(x) respectively. For any x, x X, if x and x belong to the same class, then SPi(x) SPi(x ) 1 = |fi(x)[O(x)] fi(x )[O(x)]| fi(x) fi(x ) 1 Li x x 1; otherwise, SPi(x) SPi(x ) 1 = |fi(x)[O(x)] fi(x )[O(x )]| 1 = 1 2 O(x) O(x ) 1 LO 2 x x 1. The Lipschitiz constant for SPi(x) is max{Li, LO Proof to Lemma 4.5 Proof. The proof leverages the fact that, as the sample size increases, the expected distance between x and its nearest neighbour monotonically decreases. Letting Li denote the Lipschitz constant of SPi, we have the estimation error |E[SPi(NNS(x))] SPi(x)| = E[|SPi(NNS(x)) SPi(x)|] Li E[dist(NNS(x), x)], which approaches 0 as E[dist(NNS(x), x)] decreases. Proof to Lemma 4.6 Proof. Lemma 4.5 shows that each SPi(NNSk(x)) is an unbiased estimator of SPi(x), 1 k K, as s approaches infinity. Let σ 2 i denote the variance of SPi(NNSk(x)) for each k. By the Central Limit Theorem, the distribution of the estimator 1 K PK k=1 SPi(NNSk(x)) approaches a normal distribution with variance σ 2 i K (Chang et al., 2024). C EXPERIMENT DETAILS C.1 DATASETS CIFAR-1012. CIFAR-10 (Krizhevsky et al., 2009) contains 60, 000 images of resolution 32 32, evenly divided into 10 classes, where 50, 000 images are for training and 10, 000 images are for testing. We randomly sample 20, 000 images from the training set as our validation set, and we use the remaining 30, 000 images to train our models. CIFAR-10013. Same as CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) has 50, 000 training and 10, 000 testing images. But they are evenly separated into 100 classes. We randomly sample 20, 000 training images as our validation set. 12https://www.cs.toronto.edu/ kriz/cifar.html 13https://www.cs.toronto.edu/ kriz/cifar.html Published as a conference paper at ICLR 2025 Tiny Image Net14. Tiny Image Net (CS231n) is a subset of the Image Net-1K dataset (Russakovsky et al., 2015). It covers 200 class labels and all images are in resolution 64 64. It includes 100, 000 training, 10, 000 validation, and 10, 000 testing images. The given test split does not have groundtruth labels, thus we discard this set and use the validation split as our testing data. We randomly sample 40, 000 training images as the validation data and use the remaining 60, 000 ones to train the models. Image Net-1K15. We use the image classification dataset in the Image Net Large Scale Visual Recognition Challenge (ILSVRC) 2012 (Russakovsky et al., 2015). This dataset contains 1,281,167 training, 50,000 validation, and 100,000 testing images, covering 1,000 classes. Images are of various resolutions. Since the models we use are pre-trained on this dataset, we do not train the last linear layer of the models. The given test split comes without ground-truth labels; thus we use the validation split to evaluate our method and baselines. Among the 50,000 validation images, we randomly select 10,000 of them as our testing data and the remaining ones are treated as the validation data. We use Res Net (He et al., 2016) and Swin Transformer V2 (Swin V2) (Liu et al., 2022) models on the image classification task because they are popular models for the task and many of their pre-trained weights on the Image Net-1K dataset (Russakovsky et al., 2015) are available online16, where reasonable performance are achieved. On CIFAR-10, CIFAR-100, and Tiny Image Net, we freeze everything of the pre-trained models but only train the last linear layer of each model from scratch. The output dimension of the last layer is set to be the same as the number of image classes on the test dataset. We implement the soft classifier (see Section 4.1) by sampling w.r.t. the probability distribution output by the softmax layer, i.e., Pr[fi(x) = j] = exp(zj/τ) P k exp(zk/τ), where zk is the logit for k [C] and τ is the hyper-parameter temperature controlling the randomness of predictions. We choose a small τ (1e-3) to reduce the variance in predictions. At test time, to obtain consistent results, all classifiers make predictions by outputting the most likely class labels (i.e., arg maxj fi(x)[j]), equivalently to having soft classifiers with τ 0. For all seven models, we use the Adam optimizer (Kingma & Ba, 2015) with β1 = 0.9 and β2 = 0.999, constant learning rate 0.00001, and a batch size of 500 for training. Models are trained till convergence. 14http://cs231n.stanford.edu/tiny-imagenet-200.zip 15https://image-net.org/download.php 16For example, the pre-trained models we use are from https://pytorch.org/vision/stable/ models.html. Specifically, the pre-trained weights we use are as follows. Res Net-18: Res Net18 Weights.IMAGENET1K V1 Res Net-34: Res Net34 Weights.IMAGENET1K V1 Res Net-52: Res Net50 Weights.IMAGENET1K V1 Res Net-101: Res Net101 Weights.IMAGENET1K V1 Swin V2-T: Swin V2 T Weights.IMAGENET1K V1 Swin V2-S: Swin V2 S Weights.IMAGENET1K V1 Swin V2-B: Swin V2 B Weights.IMAGENET1K V1