# geometryaware_adaptation_for_pretrained_models__980b7300.pdf

Geometry-Aware Adaptation for Pretrained Models

Nicholas Roberts, Xintong Li, Dyah Adila, Sonia Cromp, Tzu-Heng Huang, Jitian Zhao, Frederic Sala University of Wisconsin-Madison {nick11roberts, fredsala}@cs.wisc.edu {xli2224, adila, cromp, thuang273, jzhao326}@wisc.edu

Machine learning models including prominent zero-shot models are often trained on datasets whose labels are only a small proportion of a larger label space. Such spaces are commonly equipped with a metric that relates the labels via distances between them. We propose a simple approach to exploit this information to adapt the trained model to reliably predict new classes or, in the case of zero-shot prediction, to improve its performance without any additional training. Our technique is a drop-in replacement of the standard prediction rule, swapping arg max with the Fréchet mean. We provide a comprehensive theoretical analysis for this approach, studying (i) learning-theoretic results trading off label space diameter, sample complexity, and model dimension, (ii) characterizations of the full range of scenarios in which it is possible to predict any unobserved class, and (iii) an optimal active learning-like next class selection procedure to obtain optimal training classes for when it is not possible to predict the entire range of unobserved classes. Empirically, using easily-available external metrics, our proposed approach, LOKI, gains up to 29.7% relative improvement over Sim CLR on Image Net and scales to hundreds of thousands of classes. When no such metric is available, LOKI can use self-derived metrics from class embeddings and obtains a 10.5% improvement on pretrained zero-shot models such as CLIP.

1 Introduction

The use of pretrained models is perhaps the most significant recent shift in machine learning. Such models can be used out-of-the-box, either to predict classes observed during pretraining (as in Image Net-trained Res Nets [14]) or to perform zero-shot classification on any set of classes [32]. While this is exciting, label spaces are often so huge that models are unlikely to have seen even a single point with a particular label. Without additional modification, pretrained models will simply fail to reliably predict such classes. Instead, users turn to fine-tuning which requires additional labeled data and training cycles and so sacrifices much of the promise of zero-shot usage.

How can we adapt pretrained models to predict new classes, without fine-tuning or retraining? At first glance, this is challenging: predictive signal comes from labeled training data. However, relational information between the classes can be exploited to enable predicting a class even when there are no examples with this label in the training set. Such relational data is commonly available, or can be constructed with the help of knowledge bases, ontologies, or powerful foundation models [39].

How to best exploit relational structure remains unclear, with a number of key challenges: We might wish to know what particular subset of classes is rich enough to enable predicting many (or all) remaining labels. This is crucial in determining whether a training set is usable or, even with the aid of structure, insufficient. It is also unclear how approaches that use relational information interact with the statistical properties of learning, such as training sample complexity. Finally, performing adaptation requires an efficient and scalable algorithm.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Figure 1: Classification regions in the probability simplex of 3-class classifiers faced with a 100-class problem. The probability simplex using arg max prediction can only output one of three classes. LOKI uses the entire probability vector to navigate the class metric space, leading to more prediction regions. (Left) regions from arg max prediction. (Centers, Right) classification regions from LOKI.

This work addresses these challenges. It proposes a simple and practical approach to learning in structured label spaces, with theoretical guarantees. First, we offer a simple way to translate the soft outputs (i.e., probability vectors) produced by any supervised learning model into a more general model that can exploit geometric information for label structure. In other words, our approach, called LOKI,1 is a simple adaptor for pretrained models. LOKI can be applied via a fixed linear transformation of the model outputs. LOKI s simplicity makes it applicable to a broad range of settings while enabling very high-cardinality predictions subject to a potentially small model output budget we provide a visualization of this key idea in Figure 1.

Theoretically, we provide a rich set of results for the metric-based adaptation setting. First, we introduce a learning-theoretic result in terms of the sample complexity of the pretrained model. It captures a key tradeoff between the number of classes and metric structure, the problem dimensionality, and the number of samples used to train the model prior to adaptation. Next we exhaustively study the properties of training sets, determining for a wide class of relational structures the minimum number (and structure) of subsets that enable reliable prediction. Finally we show how to exploit this result in an active learning-like approach to selecting points that will improve deficient training datasets.

Experimentally, we show that using structural information improves prediction in high-cardinality settings. We demonstrate the strength of the active learning-based approach to dataset expansion over random baselines. Finally, and most excitingly, we show that even in zero-shot models like CLIP, where it is possible to produce probability vectors over any possible class, the use of our adaptor leads to a 19.53% relative improvement.

2 Background

First, we introduce the problem setting, notation, and mathematical tools that we will use. Afterward, we discuss how LOKI relates to prior work.

Problem Setting As in conventional supervised learning, we have a dataset (x1, y1), . . . , (xn, yn) drawn from a distribution on X Y, where X and Y are the input and label spaces. In our setting, N := |Y| is finite but large; often N n so that many labels will simply not be found in the training dataset. We let Λ Y with |Λ| = K be the set of observed labels. For convenience of notation, we also define λ := [λi]K i=1, y := [yi]N j=1 to be the vectors of elements in Λ and Y.

In addition to our dataset, we have access to a relational structure on Y. We assume that Y is a metric space with metric (distance) d : Y2 R; d encodes the relational structure of the label space. Specifically, we model this metric space using a graph, G = (Y, E) where E Y2 R+ is a set of edges relating the labels and the standard shortest-path distance d : Y2 R 0. In addition to its use in prediction, the metric d can be used for evaluating a model by measuring, for example, 1 n Pn i=1 d2(f(xi), y) the analogue of the empirical square loss.

Fréchet Mean Estimator Drawing on ideas from structured prediction [5, 35], we use a simple predictor that exploits the label metric. It relies on computing the Fréchet mean [8], given by

my(w) := arg min y Y

i=1 wid2(y, yi), (1)

1Refers to the locus (plural: loci) of the Fréchet mean.

where w RK 0 is a set of weights. The Fréchet mean generalizes the barycenter to metric spaces and is often used in geometric machine learning [26].

Locus of the Fréchet mean. The locus of the Fréchet mean is the set of all Fréchet means under different weights [29]. We write it as Π(y) := w K 1my(w).

Π(y) can be thought of the set of all labels in Y that are reachable by the Fréchet mean given {yi}K i=1 Y and different choices of its parameter w. Intuitively, we can think of the locus for a given dataset as describing how usable it is for predicting beyond the observed labels. Trivially, if {yi}K i=1 = Y, then Π(y) = Y. We are primarily interested in the case in which {yi}K i=1 Y yet we still have Π(y) = Y, or that |Π(y)| is at least sufficiently large.

2.1 Relation to Prior work

LOKI is primarily related to two areas: zero-shot learning and structured prediction.

Zero-Shot Learning Like zero-shot learning (ZSL), LOKI is capable of predicting unobserved classes. Our framework is most closely related to generalized ZSL, which uses side information to predict both observed and unobserved classes. Many types of external knowledge are used in ZSL, including text [28, 38, 9], attributes [22, 23, 7], knowledge graphs [42, 34, 10], ontologies [25], and logical rules [33]. Our work is most closely related to ZSL approaches that rely on knowledge graph information. Often, ZSL methods that use knowledge graph information rely on the use of graph neural network architectures [42, 17, 18]. However, we note that these architectures can be heavyweight and can be challenging to scale up to extremely large graphs, whereas LOKI does not have architectural requirements and scales linearly in the size of the graph when K is small.

Structured Prediction Structured prediction (SP) operates in label spaces that are endowed with algebraic or geometric structure [1, 21]. This includes problems such as predicting permutations [19], non-Euclidean and manifold regression [31, 36], and learning on graphs [12]. LOKI is the most immediately related to the latter, however, any finite metric space can be represented as a graph, which further lends to the flexibility of our approach. Even in discrete structured prediction settings, the cardinality of the label space may be combinatorially large. As such, LOKI can be viewed as a simple method for adapting classifiers to structured prediction.

The Fréchet mean has been used in structured prediction but in approaches requiring training. In [5], ˆf(x) = arg miny Y 1

n Pn i=1 αi(x)d2(y, yi), where α(x) = (K + nνI) 1Kx. K is the kernel matrix for a kernel k : X X R, so that Ki,j = k(xi, xj), (Kx)i = k(x, xi). ν is a regularization parameter. In other words, the weight wi corresponding to yi is the average produced by solving a kernel regression problem at all points xk in the dataset where yk = yi. It has also used in weak supervision (WS) for metric-equipped label spaces [41, 37], where the goal is to produce labeled data for training structured prediction models.

3 Framework

We introduce our framework LOKI. We show how LOKI can be used to adapt any supervised classifier over a set of K classes to a much richer set of possible class predictions. It does so by weighting the Fréchet mean by the classifier s per-class prediction probabilities or logits, allowing it to predict any class in the locus of the Fréchet mean potentially far more classes than the initial K. Next, we show how LOKI can be expressed as a fixed linear transformation of a model s outputs. Finally, we show that LOKI relates to standard classification.

3.1 LOKI: Adapting Pretrained Models

We describe our approach to adapting pretrained classifiers trained on a set of classes Λ to the metric geometry of the label space Y, enabling the prediction of unobserved classes.

We model unobserved classes Y \ Λ using the Fréchet mean among observed classes weighted by their prediction probabilities P(y = λi|x and y Λ). We denote the vector of model outputs as

Py|x := [Pλi|x]K i=1 = [P(y = λi|x and y Λ)]K i=1. Then predictions using LOKI are given by

ˆy mλ(Py|x) = arg min y Y

i=1 Pλi|xd2(y, λi).

3.2 LOKI as a linear transformation of model outputs

Most standard classifiers output a vector of prediction probabilities, Py|x, whose entries correspond to the confidence of predicting a specific class. Predictions are typically given by ˆy arg maxi [K](Py|x)i. LOKI generalizes this prediction rule when viewed as a linear transformation of Py|x. Consider the LOKI prediction rule ˆy mλ(Py|x) = arg miny Y PK i=1 Pλi|xd2(y, λi) = arg maxj [N](DPy|x)j, where Dj,i = d2(yj, λi) ; D RN K is the matrix of negative squared distances between the observed classes and the rest of the label space. Thus LOKI can be used within standard classification pipelines when the model output Py|x is multiplied by the fixed matrix D.

3.3 Generalizing Standard Classification

We provide a simple intuition for our approach. The fact that LOKI reduces to standard classification among the observed classes has several implications. This includes the idea that under our framework, forms of few-shot, zero-shot, hierarchical, and partial label learning all reduce to standard classification when additional metric information is introduced.

Generalizing the arg max prediction rule In the absence of this metric information a situation that we model using the complete graph and setting Λ = Y our framework also recovers standard classification. Indeed, both in terms of error modeling and in terms of inter-class similarity, the intuition of standard classification and of the 0-1 loss are captured well by the unweighted complete graph simply treat all differences equally. This graph is given as KN := (Y, Y2 {1}) i.e., every label is equidistant from every other label. Plugging this into LOKI, we obtain the following:

ˆy mλ(Py|x) = arg min y Y

i=1 Pλi|xd2(y, λi) = arg min y Y

i=1 Pλi|x1{y = λi} = arg max i [K] Pλi|x,

which is exactly the standard classification prediction rule.

Generalizing the 0-1 loss via the expected squared distance The expected squared distance E[d2(y, ˆy)] is the standard loss function for many structured prediction problems. Note that accuracy fails in such settings since it cannot distinguish between small and large errors. This is most clearly seen in the extreme case of regression, where test accuracy will virtually always be zero no matter how good a trained model is. At the other extreme, the complete graph, this loss function becomes the standard 0-1 loss: E[d2(y, ˆy)] = E[1{y = ˆy}]. As an adaptor for structured label spaces, we use the empirical version of this loss to evaluate LOKI. Note that the expected squared distance subsumes other metrics as well. For example, when Y = R, we can derive the standard MSE by setting d(y, ˆy) = |y ˆy|, which is just the standard L1 distance metric. Other scores such as recall at top-k can be similarly obtained at the cost of d( , ) being a true metric. In other words, E[d2(y, ˆy)] is an very general metric that supports any metric space, and we use it throughout this work.

4 Theoretical Results

Challenges and Opportunities The arg max of per-class model probabilities is a ubiquitous component of classification pipelines in machine learning. In order to predict unobserved classes using metric space information, LOKI replaces this standard component. As a simple but significant change to standard pipelines, LOKI opens up a new area for fundamental questions. There are three main flavors of theoretical questions that arise in this setting:

1. How does the performance of LOKI change as a function of the number of samples? 2. What minimal sets of observed classes are required to predict any class in the metric space? 3. How can we acquire new classes that maximize the total number of classes we can predict?

Excitingly, we provide a comprehensive answer to these questions for the most common metric spaces used in practice. First, we provide a general error bound in terms of the number of samples, observed classes, problem dimension, and the diameter of the metric space, that holds for any finite metric space. Second, we characterize the sets of observed classes that are required to enable prediction of any class, and show how this set differs for various types of metric spaces of interest. Finally, we provide an active learning algorithm for selecting additional classes to observe so as to maximize the number of classes that can be predicted and we characterize the types of metric spaces for which the locus can be computed efficiently. These results provide a strong theoretical grounding for LOKI.

4.1 Sample Complexity

What is the interplay between predicting unobserved classes based on metric space information and standard learning-theoretic notions like the sample complexity needed to train a model? Our first result illustrates this tradeoff, and relates it to the squared distance loss. Suppose that we only observe K N of the classes at training time, and that we fit a K-class Gaussian mixture model. We use LOKI to adapt our pretrained classifier to enable classification of any of the N classes. We have that (a formal statement and interpretation are in the Appendix), Theorem 4.1 (Informal LOKI sample complexity). Let Y be a set of classes represented by d dimensional vectors under the Euclidean distance, and let Λ Y be the set of K observed classes. Assume that n training examples are generated by an identity covariance Gaussian mixture model over classes Λ, and that test examples are generated over all classes Y. Assume that we estimate a Gaussian mixture model on the training set and obtain probability estimates ˆP(yi|x) for i [K] for a sample (x, y ) from the test distribution. Then with high probability, under the following model,

ˆy mΛ([ˆP(yi|x)]i [K]) = arg min y Y

ˆP(yi|x)d2(y, yi)

the sample complexity of estimating target y from the test distribution Dtest with prediction ˆy is:

E(x,y ) Dtest[d2(y , ˆy )] O

where α is related to the sensitivity of the Fréchet variance to different node choices.

4.2 Optimal Label Subspaces

Next we characterize the subset of distinct labels required to predict any label using our label model with respect to various types of metric spaces.

We first consider label spaces whose metric is a tree graph such metrics are, for example, related to performing hierarchical classification and weak supervision (where only partial labels are available). We consider a special type of tree called a phylogenetic tree, in which only the leaves can be designated as labels phylogenetic trees are commonly used to relate the labels of image classification datasets. Afterwards we perform a similar analysis for grid graphs, which are important for label spaces that encode spatial information. Finally, we discuss the case in which no useful metric information is available, i.e., the complete graph.

Our goal in this section is to characterize the properties and size of {λi}K i=1 in each of these metric spaces such that we still have Π(Λ) = Y. We characterize optimal subsets of classes in each of the spaces under a certain notions of optimality. We provide several relevant definitions pertaining to this concept, starting with a notion of being able to predict any possible class using observed classes.

Definition 4.2 (Locus cover). Given a set Λ Y for which we construct a tuple of its elements Λ, if it holds that Π(Λ) = Y, then Λ is a locus cover.

Definition 4.2 captures the main idea of LOKI using some set of observed classes for which we can train classifiers, we would like to be able to predict additional unobserved classes using the geometry that relates the observed and unobserved classes. Namely, elements of Π(Λ) are reachable using LOKI. We refine this Definition to describe the trivial case that defaults to standard classification and the nontrivial case for which LOKI moves beyond standard classification.

Definition 4.3 (Trivial locus cover). If Λ = Y, then Λ is the trivial locus cover.

This Definition captures the notion of observing all of the classes in the label space. Here, all of the elements of Y are trivially reachable using LOKI.

Definition 4.4 (Nontrivial locus cover). A locus cover Λ is nontrivial if Λ = Y.

LOKI is more useful and interesting when faced with a nontrivial locus cover under Definition 4.4, we can use some subset of classes Λ to predict any label in Y.

Definition 4.5 (Minimum locus cover). Given a set Λ Y, if Λ is the smallest set that is still a locus cover, then it is a minimum locus cover.

In cases involving an extremely large number of classes, it is desirable to use LOKI on the smallest possible set of observed classes Λ such that all labels in Y can still be predicted. Definition 4.5 characterizes these situations later, we obtain the minimum locus covers for all trees and grid graphs. It is worth noting that the minimum locus cover need not be unique for a fixed graph.

Definition 4.6 (Identifying locus cover). Given a set Λ Y, if Λ is a locus cover where y Y, w |Λ| 1 such that mΛ(w) = {y}, then Λ is an identifying locus cover.

The Fréchet mean need not be unique as an arg min, it returns a set of minimizers. In certain metric spaces, the minimum locus cover can yield large sets of minimizers this is undesirable, as it makes predicting a single class challenging. Definition 4.6 appeals to the idea of finding some set of classes for which the Fréchet mean always returns a unique minimizer this is desirable in practice, and in some cases, moreso than Definition 4.5.

Definition 4.7 (Pairwise decomposable). Given Λ Y, Π(Λ) is called pairwise decomposable when it holds that Π(Λ) = λ1,λ2 ΛΠ({λ1, λ2}).

In many cases, the locus can be written in a more convenient form the union of the locus of pairs of nodes. We refer to this definition as pairwise decomposability. Later, we shall see that pairwise decomposability is useful in computing the locus in polynomial time.

Trees Many label spaces are endowed with a tree metric in practice: hierarchical classification, in which the label space includes both classes and superclasses, partial labeling problems in which internal nodes can represent the prediction of a set of classes, and the approximation of complex or intractable metrics using a minimum spanning tree. We show that for our purposes, trees have certain desirable properties that make them easy to use with LOKI namely that we can easily identify a locus cover that satisfies both Definition 4.5 and Definition 4.6. Conveniently, we also show that any locus in any tree satisfies Definition 4.7.

We first note that the leaves of any tree yield the minimum locus cover. This is a convenient property any label from any label space endowed with a tree metric can be predicted using LOKI using only classifiers trained using labels corresponding to the leaves of the metric space. This can be especially useful if the tree has long branches and few leaves. Additionally, for tree metric spaces, the minimum locus cover (Definition 4.5) is also an identifying locus cover (Definition 4.6). This follows from the construction of the weights in the proof of Theorem A.4 (shown in the Appendix) and the property that all paths in trees are unique. Finally, we note that any locus in any tree is pairwise decomposable the proof of this is given in the Appendix (Lemma A.5). We will see later that this property yields an efficient algorithm for computing the locus.

Phylogenetic Trees Image classification datasets often have a hierarchical tree structure, where only the leaves are actual classes, and internal nodes are designated as superclasses examples include the Image Net [6] and CIFAR-100 datasets [20]. Tree graphs in which only the leaf nodes are labeled are referred to as phylogenetic trees [3]. Often, these graphs are weighted, but unless otherwise mentioned, we assume that the graph is unweighted.

For any arbitrary tree T = (V, E), the set of labels induced by phylogenetic tree graph is Y = Leaves(T). We provide a heuristic algorithm for obtaining locus covers for arbitrary phylogenetic trees in Algorithm B.1 (see Appendix). We prioritize adding endpoints of long paths to Λ, and continue adding nodes in this way until Π(Λ) = Y. Similarly to tree metric spaces, any phylogenetic tree metric space is pairwise decomposable. We prove the correctness of Algorithm B.1 and pairwise

decomposability of phylogenetic trees in the Appendix (Theorem A.6 and Lemma A.7). Later, we give algorithms for computing the set of nodes in an arbitrary locus in arbitrary graphs if the locus is pairwise decomposable, the algorithm for doing so is efficient, and if not, it has time complexity exponential in K. Due to the pairwise decomposability of phylogenetic trees, this polynomial-time algorithm to compute Π(Λ) applies.

Grid Graphs Classes often have a spatial relationship. For example, classification on maps or the discretization of a manifold both have spatial relationships grid graphs are well suited to these types of spatial relationships. We obtain minimum locus covers for grid graphs satisfying Definition 4.5, but we find that these are not generally identifying locus covers. On the other hand, we give an example of a simple identifying locus cover satisfying Definition 4.6. Again, we find that grid graphs are in general pairwise decomposable and hence follow Definition 4.7.

We find that the pair of vertices on furthest opposite corners yields the minimum locus cover. While the set of vertices given by Theorem A.8 (found in the Appendix) satisfies Definition 4.5, this set does not in general satisfy Definition 4.6. This is because the path between any two vertices is not unique, so each minimum path of the same length between the pair of vertices can have an equivalent minimizer. On contrast, the following example set of vertices satisfies Definition 4.6 but it clearly does not satisfy Definition 4.5. Example: Given a grid graph, the set of all corners is an identifying locus cover. On the other hand, the vertices given by Theorem A.8 can be useful for other purposes. Lemma A.9 (provided in the Appendix) shows that subspaces of grid graphs can be formed by the loci of pairs of vertices in Λ. This in turn helps to show that loci in grid graphs are pairwise decomposable in general (see Lemma A.10 in the Appendix).

The Complete Graph The standard classification setting does not use relational information between classes. As before, we model this setting using the complete graph, and we show the expected result that in the absence of useful relational information, LOKI cannot help, and the problem once again becomes standard multiclass classification among observed classes. To do so, we show that there is no nontrivial locus cover for the complete graph (Theorem A.11 in the Appendix).

4.3 Label Subspaces in Practice

While it is desirable for the set of observed classes to form a minimum or identifying locus cover, it is often not possible to choose the initial set of observed classes a priori these are often random. In this section, we describe the more realistic cases in which a random set of classes are observed and an active learning-based strategy to choose the next observed class. The aim of our active learning approach is, instead of randomly selecting the next observed class, to actively select the next class so as to maximize the total size of the locus i.e., the number of possible classes that can be output using LOKI. Before maximizing the locus via active learning, we must first address a much more basic question: can we even efficiently compute the locus?

Computing the Locus We provide algorithms for obtaining the set of all classes in the locus, given a set of classes Λ. We show that when the locus is pairwise decomposable (Definition 4.7), we can compute the locus efficiently using a polynomial time algorithm. When the locus is not pairwise decomposable, we provide a general algorithm that has time complexity exponential in |Λ| we are not aware of a more efficient algorithm. We note that any locus for every type of graph that we consider in Section 4.2 is pairwise decomposable, so our polynomial time algorithm applies. Algorithms B.2 and B.3 along with their time complexity analyses can be found in the Appendix.

Large Locus via Active Next-Class Selection We now turn to actively selecting the next class to observe in order to maximize the size of the locus. For this analysis, we focus on the active learning setting when the class structure is a tree graph, as tree graphs are generic enough to apply to a wide variety of cases including approximating other graphs using the minimum spanning tree. Assume the initial set of K observed classes are sampled at random from some distribution. We would like to actively select the K + 1st class such that |Π(Λ)| with Λ = {λ}K+1 i=1 is as large as possible. Theorem 4.8. Let T = (Y, E) be a tree graph and let Λ Y with K = |Λ|. Let T be the subgraph of the locus Π(Λ). The vertex v Y \ Λ that maximizes |Π(Λ {v})| is the solution to the following optimization problem: arg maxy Y\Π(Λ) d(y, b) s.t. b in T and Γ(y, b) \ {b} Y \ Π(Λ). where in T is the inner boundary of T (all vertices in T that share an edge with vertices not in T ).

Table 1: CIFAR-100. Improving CLIP predictions using LOKI. Results are reported as E[d2(y, ˆy)] in the respective metric space. CLIP-like zero-shot models can be improved using LOKI even without access to an external metric, and internal class embedding distances are used. When an external metric is available, LOKI outperforms CLIP using the default CIFAR-100 hierarchy and Word Net.

Model Metric Space arg max LOKI Relative Improvement

CLIP-RN50 [32] Internal 0.2922 0.2613 10.57% CLIP-Vi T-L-14 [32] Internal 0.1588 0.1562 1.63% ALIGN [16] Internal 0.1475 0.1430 3.02%

CLIP-RN50 [32] K100 0.5941 0.5941 0.0% CLIP-RN50 [32] Default tree 7.3528 7.1888 2.23% CLIP-RN50 [32] Word Net tree 24.3017 19.5549 19.53% methods equivalent under the complete graph as LOKI reduces to arg max prediction.

This procedure can be computed in polynomial time solving the optimization problem in Theorem 4.8 simply requires searching over pairs of vertices. Hence we have provided an efficient active learning-based strategy to maximize the size of the locus for trees.

5 Experimental Results

In this section, we provide experimental results to validate the following claims:

1. LOKI improves performance of zero-shot foundation models even with no external metric. 2. LOKI adapts to label spaces with a large number of unobserved classes. 3. The active approach given in Theorem 4.8 yields a larger locus than the passive baseline. 4. The same active approach yields better performance on Image Net. 5. With LOKI, calibration can improve accuracy, even with no external metric.

5.1 LOKI Improves Zero-Shot Models

We evaluate the capability of LOKI to improve upon zero-shot models where all classes are observed.

Setup Our experiment compares the zero-shot prediction performance of CLIP [32] on CIFAR-100 [20] to CLIP logits used with LOKI. First, we consider the setting in which no external metric relating the labels is available, and instead derive internal metric information from Euclidean distances between text embeddings from the models using their respective text encoders. Second, we consider three external metric spaces for use with LOKI: the complete graph K100 and two phylogenetic trees: the default CIFAR-100 superclasses [20] and Word Net [2].

Results The results of this experiment are given in Table 1. When no external metric information is available, LOKI still outperforms CLIP-like models that use the standard prediction rule in other words, LOKI seems to unconditionally improve CLIP. As expected, under the complete graph, our method becomes equivalent to the standard prediction mechanism used by CLIP. On the other hand, LOKI outperforms CLIP when using the default CIFAR-100 tree hierarchy and even more so when using the Word Net geometry, with a 19.53% relative improvement in mean squared distance over the CLIP baseline. We postulate that the strong performance using Word Net is due to the richer geometric structure compared to that of the default CIFAR-100 hierarchy.

5.2 LOKI on Partially-Observed Label Spaces

To validate our approach on partially observed label spaces, we evaluate the performance of adapting a logistic classifier trained on Sim CLR embeddings of Image Net [6], 5-NN models trained on a 9,419 class subset of the Pub Med dataset,2 and the 325,056-class LSHTC dataset [30].

2https://www.kaggle.com/datasets/owaiskhan9654/pubmed-multilabel-text-classification

Class sampling under uniform distribution

Sim CLR Sim CLR + Loki

0 200 400 600 800 1000 K

Class sampling under Gibbs distribution

Sim CLR Sim CLR + Loki

100 101 102

# of selections

Active next class selection

Active learning Random

500 510 520 530 540 550 K

Active next-class selection on Image Net using Sim CLR + Loki

Sim CLR + Loki (passive) Sim CLR + Loki (active)

Figure 2: (Top left) Image Net. Mean squared distances under uniform class sampling and under the Gibbs distribution LOKI improves upon the baseline Sim CLR one-vs-rest classifier. (Top right) Synthetic. Active class selection consistently leads to larger loci compared to uniform sampling. (Bottom) Active selection on Image Net. Active class selection improves performance on Image Net.

Table 2: Pub Med. LOKI improves baseline for all settings of K. The metric space is Euclidean distances applied to Sim CSE embeddings.

K 5-NN 5-NN + LOKI

100 1.68666 1.42591 250 1.52374 1.47801 500 1.64074 1.45921

Table 3: LSHTC. LOKI improves baseline for all settings of K. We summarize the metric space graph by generating 10,000 supernodes.

K 5-NN 5-NN + LOKI

50 1.2519 0.2896 100 1.4476 0.3467 150 1.7225 0.3969 200 1.7092 0.3171 250 1.6465 0.3995 300 1.6465 0.4004

Setup For Image Net, we use the Word Net phylogenetic tree as the metric space [2]. In this setting, we sample random subsets of size K of the 1000 Image Net classes and compare a baseline one-vs-rest classifier to the same classifier but using LOKI to adapt predictions to classes beyond the original K. We conduct two sets of experiments. In the first, we sample K classes uniformly, while in the second, we adopt a more realistic sampler we sample from a Gibbs distribution: P(λ|θ) = 1 Z exp( θd(λ, λc)), where λc = m Y(1N) is the centroid of the metric space, θ is the concentration parameter around the centroid, and Z is the normalizer. While the Gibbs distribution sampler is more realistic, it is also the more challenging setting classes which have low probability according to this distribution are less likely to appear in the locus. For Pub Med, we derive our metric from Euclidean distances between Sim CSE class embeddings [11]. Finally for LSHTC, we summarize the default graph by randomly selecting nodes and merging them with their neighbors until we obtain a graph with 10,000 supernodes representing sets of classes.

Results Figure 2 shows the mean squared distances compared to the baseline one-vs-rest classifiers, across various settings of K. We find that LOKI always significantly outperforms the baseline, even in the more challenging setting of sampling according to a Gibbs distribution. Tables 2 and 3 show our improvements when using LOKI over the baseline 5-NN models. While LOKI consistently yields an improvement on Pub Med and LSHTC, the improvement is more dramatic on LSHTC.

5.3 Large Loci via Active Learning

We evaluate our active next class selection approach on increasing the locus size. We expect that compared to passive (random) selection, our active approach will lead to larger loci, and in practice, a larger set of possible classes that can be predicted using LOKI while observing fewer classes during training.

Setup We compare with a baseline that randomly selects the next class. We first generate a synthetic random tree with size N and fix an initial K. In active selection, we use the approach described in Theorem 4.8 to select the next node. As a passive baseline, we randomly sample (without replacement) the remaining nodes that are not in Λ.

0.2 0.4 0.6 0.8 1.0 Confidence

Reliability

ECE optimal Loki optimal CLIP default Perfect calibration

0.26 0.28 0.30 0.32 0.34 0.36

ECE vs. Loki perf.

ECE optimal CLIP default Loki optimal

Softmax temperature

0.26 0.27 0.28 0.29 Expected squared distance

Zero-one error

CIFAR-100 CLIP internal metric

error tradeoff

Argmax prediction CLIP default

Softmax temperature

Figure 3: CLIP on CIFAR-100 with no external metric. (Left) Reliability diagrams across a range of Softmax temperatures, highlighting the CLIP default temperature, the optimal temperature for LOKI, and the minimizer of the Expected Calibration Error (ECE). All three are well-calibrated. (Center) Tradeoff between optimizing for ECE and the expected squared distance. As with the reliability diagrams, the CLIP default temperature, the LOKI-optimal temperature, and the ECEoptimal temperature are similar. (Right) Tradeoff between optimizing for zero-one error and the expected squared distance. Temperature can be tuned to improve accuracy when using LOKI.

Results The results are shown in Figure 2. We set N = 100 and the initial K = 3, then we average the result over 10 independent trials. The active approach consistently outperforms the random baseline as it attains a larger locus with fewer nodes selected.

5.4 Improving Performance via Active Learning

Next, we evaluate the our active next class selection approach on improving error on Image Net. While we found that the the active approach indeed leads to an increased locus size compared to the passive baseline, we expect that this increased locus size will lead to improved performance.

Setup We randomly sample 500 Image Net classes, and using the Word Net metric space, we use our active approach to iteratively sample 50 more classes. We compare this to a passive baseline in which the 50 classes are sampled randomly. We repeat this experiment over 10 independent trials.

Results Figure 2 shows that our active approach yields improved error over the passive baseline. The gap between the active approach and the passive baseline widens with more selection rounds.

5.5 Improving Accuracy via Calibration

Finally, we evaluate the effect of calibrating the Softmax outputs on the performance of LOKI.

Setup We calibrate via Softmax temperature scaling [13] using CLIP on CIFAR-100. We do not use an external metric space, and instead use Euclidean distance applied to the CLIP text encoder.

Results The reliability diagram in Figure 3 shows that the optimal Softmax temperature for LOKI is both close to the default temperature used by CLIP and to the optimally-calibrated temperature. In Figure 3 (right), we find that appropriate tuning of the temperature parameter can lead to improved accuracy with CLIP, even when no external metric space is available.

6 Conclusion

In this work, we proposed LOKI a simple adaptor for pretrained models to enable the prediction of additional classes that are unobserved during training by using metric space information. We comprehensively answered the space of new questions that arise under LOKI in terms of learning, optimal metric space settings, and a practical active selection strategy. Experimentally, we showed that LOKI can be used to improve CLIP even without external metric space information, can be used to predict a large number of unobserved classes, and a validation of our active selection strategy.

Acknowledgements

We are grateful for the support of the NSF under CCF2106707 (Program Synthesis for Weak Supervision) and the Wisconsin Alumni Research Foundation (WARF).

[1] Gökhan Bakir, Thomas Hofmann, Bernhard Schölkopf, Alexander J. Smola, Ben Taskar, and S.V.N Vishwanathan. Predicting Structured Data. The MIT Press, July 2007.

[2] Björn Barz and Joachim Denzler. Hierarchy-based image embeddings for semantic image retrieval. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 638 647. IEEE, 2019.

[3] Louis J. Billera, Susan P. Holmes, and Karen Vogtmann. Geometry of the space of phylogenetic trees. Advances in Applied Mathematics, 27(4):733 767, 2001.

[4] Yeshwanth Cherapanamjeri, Nicolas Flammarion, and Peter L. Bartlett. Fast mean estimation with sub-gaussian rates. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 786 806. PMLR, 25 28 Jun 2019.

[5] Carlo Ciliberto, Lorenzo Rosasco, and Alessandro Rudi. A consistent regularization approach for structured prediction. In Advances in Neural Information Processing Systems 30 (NIPS 2016), volume 30, 2016.

[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A largescale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248 255, 2009.

[7] Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their attributes. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 1778 1785, 2009.

[8] Maurice R. Fréchet. Les éléments aléatoires de nature quelconque dans un espace distancié. Annales de l institut Henri Poincaré, 1948.

[9] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc' Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.

[10] Junyu Gao, Tianzhu Zhang, and Changsheng Xu. I Know the Relationships: Zero-Shot Action Recognition via Two-Stream Graph Convolutional Networks and Knowledge Graphs. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):8303 8311, July 2019.

[11] Tianyu Gao, Xingcheng Yao, and Danqi Chen. Sim CSE: Simple contrastive learning of sentence embeddings. In Empirical Methods in Natural Language Processing (EMNLP), 2021.

[12] Colin Graber and Alexander Schwing. Graph structured prediction energy networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[13] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1321 1330. PMLR, 06 11 Aug 2017.

[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770 778, 2016.

[15] Daniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavy tails. Journal of Machine Learning Research, 17(18):1 40, 2016.

[16] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 4904 4916. PMLR, 18 24 Jul 2021.

[17] Michael C. Kampffmeyer, Yinbo Chen, Xiaodan Liang, Hao Wang, Yujia Zhang, and Eric P. Xing. Rethinking knowledge graph propagation for zero-shot learning. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11479 11488, 2018.

[18] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017.

[19] Anna Korba, Alexandre Garcia, and Florence d'Alché-Buc. A structured prediction approach for label ranking. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

[20] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

[21] Volodymyr Kuleshov and Percy S Liang. Calibrated structured prediction. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.

[22] Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 951 958, 2009.

[23] Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453 465, 2014.

[24] M. Lerasle and R. I. Oliveira. Robust empirical mean estimators, 2011.

[25] Juan Li, Ruoxu Wang, Ningyu Zhang, Wen Zhang, Fan Yang, and Huajun Chen. Logic-guided semantic representation learning for zero-shot relation classification. In Proceedings of the 28th International Conference on Computational Linguistics, pages 2967 2978, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics.

[26] Aaron Lou, Isay Katsman, Qingxuan Jiang, Serge Belongie, Ser-Nam Lim, and Christopher De Sa. Differentiating through the fréchet mean. In Proceedings of the 37th International Conference on Machine Learning, pages 6393 6403, 2020.

[27] Stanislav Minsker. Geometric median and robust estimation in Banach spaces. Bernoulli, 21(4):2308 2335, 2015.

[28] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. In Yoshua Bengio and Yann Le Cun, editors, ICLR, 2014.

[29] Tom M W Nye, Xiaoxian Tang, Grady Weyenberg, and Ruriko Yoshida. Principal component analysis and the locus of the fréchet mean in the space of phylogenetic trees. Biometrika, 104(4):901 922, Dec 2017.

[30] Ioannis Partalas, Aris Kosmopoulos, Nicolas Baskiotis, Thierry Artières, Georgios Paliouras, Éric Gaussier, Ion Androutsopoulos, Massih-Reza Amini, and Patrick Gallinari. Lshtc: A benchmark for large-scale text classification. Ar Xiv, abs/1503.08581, 2015.

[31] Alexander Petersen and Hans-Georg Muller. Fréchet regression for random objects with euclidean predictors. The Annals of Statistics, 2016.

[32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748 8763. PMLR, 2021.

[33] Tim Rocktäschel, Sameer Singh, and Sebastian Riedel. Injecting logical background knowledge into embeddings for relation extraction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1119 1129, Denver, Colorado, May June 2015. Association for Computational Linguistics.

[34] Abhinaba Roy, Deepanway Ghosal, Erik Cambria, Navonil Majumder, Rada Mihalcea, and Soujanya Poria. Improving Zero-Shot Learning Baselines with Commonsense Knowledge. Cognitive Computation, 14(6):2212 2222, November 2022.

[35] Alessandro Rudi, Carlo Ciliberto, Gian Maria Marconi, and Lorenzo Rosasco. Manifold structured prediction. In Advances in Neural Information Processing Systems 32 (Neur IPS 2018), volume 32, 2018.

[36] Alessandro Rudi, Carlo Ciliberto, Gian Maria Marconi, and Lorenzo Rosasco. Manifold structured prediction. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

[37] Changho Shin, Winfred Li, Harit Vishwakarma, Nicholas Carl Roberts, and Frederic Sala. Universalizing weak supervision. In International Conference on Learning Representations, 2022.

[38] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning through cross-modal transfer. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.

[39] Russell Stewart and Stefano Ermon. Label-free supervision of neural networks with physics and domain knowledge. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.

[40] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing, 2010.

[41] Harit Vishwakarma and Frederic Sala. Lifting weak supervision to structured prediction. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.

[42] X. Wang, Y. Ye, and A. Gupta. Zero-shot recognition via semantic embeddings and knowledge graphs. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6857 6866, Los Alamitos, CA, USA, jun 2018. IEEE Computer Society.

The appendix is organized as follows: in Appendix A, we provide proofs of the claims made in the main text. Next, in Appendix B, we provide algorithms and time complexity analyses for the algorithms referenced in the main text. Appendix C contains additional experimental results. In Appendix D, we provide additional details about our experimental setup. Then in Appendix E, we discuss the broader impacts and limitations of LOKI. Finally, in Appendix F, we provide more details and examples of the locus visualizations shown in Figure 1.

A Deferred Proofs

A.1 Proof of Theorem 4.1 (LOKI sample complexity)

In this section, we provide a formal statement and proof of our sample complexity result for LOKI, including additional required definitions and assumptions.

First, we define the required tools to prove the sample complexity bound for LOKI. For the purposes of this proof, we define the Fréchet variance as the function over which the Fréchet mean returns the minimizer. Definition A.1 (Fréchet variance). The Fréchet variance is defined as

Ψw(V ) := X

i [K] wid2(V, Vi).

Additionally, we will require a technical assumption related to the sensitivity of the Fréchet variance to different node choices. Assumption A.2 (Fréchet variance is 1

α-bi-Lipschitz). For a metric space defined by a graph G = (V, E), the Fréchet variance is K-bi-Lipschitz if there exists a K 1 such that

1 K d2(V, V ) |Ψw(V ) Ψw( V )| Kd2(V, V )

for all V, V V, and a fixed w K 1. For our purposes, such a K always exists: consider setting K = diam(G)2 max V1,V2 V |Ψw(V1) Ψw(V2)|. However, this is a very conservative bound that holds for all graphs that we consider. Instead, we assume access to the largest α = 1

K 1 such that

αd2(V, V ) |Ψw(V ) Ψw( V )|,

which may be problem dependent.

Theorem A.3 (LOKI sample complexity). Let Y be a set of points on the d dimensional 2-norm ball of radius R, and let Λ Y be the set of K observed classes. Assume that Λ forms a 2R/(

d K 1)- net under the Euclidean distance. Assume that training examples are generated by drawing n samples from the following process: draw x N(y, I) where y Unif(Λ), and at test time, draw x N(y, I) where y Unif(Y). Assume that we estimate a Gaussian mixture model with K components (each having identity covariance) on the training set and obtain probability estimates ˆP(yi|x) for i [K] for a sample (x, y ) from the test distribution. Then with high probability, under the following model,

ˆy mΛ([ˆP(yi|x)]i [K]) = arg min y Y

ˆP(yi|x)d2(y, yi)

the sample complexity of estimating target y from the test distribution Dtest with prediction ˆy is:

E(x,y ) Dtest[d2(y , ˆy )] O

where d is the dimensionality of the input and α is our parameter under Assumption A.2.

Proof. We begin by detailing the data-generating process.

At training time, our underlying data-generating process, Dtrain, is as follows:

We begin with a 2R

d K 1-net, Y, on the d dimensional 2-norm ball of radius R under the Euclidean distance, and let Λ Y

Draw y Unif(Y).

Discard draws if y Λ.

Draw x N(y, I).

At test time, we do not discard draws and allow for classes not in Λ and use the following data generating process, Dtest:

We begin with a 2R

d K 1-net, Y, on the d dimensional 2-norm ball of radius R under the Euclidean distance, and let Λ Y

Draw y Unif(Y).

Draw x N(y, I).

Given a labeled training dataset D = {(xi, yi)}n i=1 containing n points drawn from Dtrain with |Λ| = K distinct classes, we would like to fit a K-component Gaussian mixture model with identity covariance. We first perform mean estimation of each of the classes separately using the median-ofmeans estimator [15, 27, 24]. Using the estimator yields the following parameter estimation bound [27, 4]:

||yi ˆyi||2 O

with probability 1 δ.

Next, we consider the relationships between four quantities: ΨP(y ), ΨP(ˆy ), Ψb P(y ), Ψb P(ˆy ), where P is the vector of probabilities from the true Gaussian mixture, b P is the vector of probabilities from the estimated model, y mΛ(P) is the target class, and ˆy mΛ(b P) is the predicted class. While it is problem-dependent as to whether ΨP(y ) Ψb P(ˆy ) or ΨP(y ) > Ψb P(ˆy ), a similar argument holds for both cases. So without loss of generality, we assume that ΨP(y ) Ψb P(ˆy ). Then by the definition of the Fréchet mean, the following inequalities hold:

ΨP(y ) Ψb P(ˆy ) Ψb P(y ),

and consequently,

Ψb P(ˆy ) ΨP(y ) Ψb P(y ) ΨP(y ). (2)

We proceed by obtaining upper and lower bounds of the existing bounds in Equation 2. First, we will obtain an upper bound on Ψb P(y ) ΨP(y ).

|Ψb P(y ) ΨP(y )| =

i [K] (b P P)d2(y , Vi)

b P(yi|x) P(yi|x) ||y yi||2 2

b P(x|yi)P(yi) P j [K] b P(x|yj)P(yj) P(x|yi)P(yi) P

j [K] P(x|yj)P(yj)

||y yi||2 2

2||x ˆyi||2 2} 1

j [K] exp{ 1

2||x ˆyj||2 2} 1

2||x yi||2 2} 1

K P j [K] exp{ 1

2||x yj||2 2} 1

||y yi||2 2

2||x ˆyi||2 2} P

j [K] exp{ 1

2||x ˆyj||2 2}

2||x yi||2 2} P

j [K] exp{ 1

2||x yj||2 2}

||y yi||2 2

For notational convenience, we define the following: ai := exp{ 1

2||x yi||2 2}, ˆai := exp{ 1

2||x ˆyi||2 2}, b := P

j [K] exp{ 1

2||x yj||2 2}, ˆb := P

j [K] exp{ 1

2||x ˆyj||2 2}, and ci := ||ˆy yi||2 2.

Then (4) becomes:

Now we define the following: L := ||x yz||2 2 with z arg minj ||x yj||2 2 is the smallest distance to a class mean, and L + Ei := ||x yi||2 2 with Ei > 0. Similarly, define ˆL := ||x ˆyz||2 2 with z arg minj ||x ˆyj||2 2 is the smallest distance to an estimated class mean, and ˆL+ ˆEi := ||x ˆyi||2 2 with ˆEi > 0. Finally, we define T := p

ˆL + pˆL + ˆEi +

L + L + Ei.

Then we can bound the parts separately:

For i = z, we have

2||x yi||2 2} P

j [K] exp{ 1

2||x yj||2 2}

2(L + Ei)} exp{ 1

2L} + exp{ 1

and in the case of i = z, we bound az

b 1. So overall, we have ai

b 1 exp{1i =z 1

2 Ei} for all i [K].

ci = ||y yi||2 2 2||x yi||2 2 + 2||x y ||2 2 = 2(L + Ei + ||x y ||2 2). (7)

ˆai ai 1 = exp 1

2||x yi||2 2 1

2||x ˆyi||2 2

2||x yi||2 2 1

2||x ˆyi||2 2 1

2 ˆyi yi, 2x yi ˆyi

2||ˆyi yi|| ||2x yi ˆyi||

||ˆyi yi|| (||x yi|| + ||x ˆyi||) = ||ˆyi yi|| (||x yi|| + ||x yi + yi ˆyi||) ||ˆyi yi|| (2||x yi|| + ||yi ˆyi||)

= ||ˆyi yi|| 2 p

L + Ei + ||yi ˆyi|| . (9)

Next, we must control Ei ˆEi in order to obtain the bound for ˆai

Ei ˆEi = (||x yi||2 L) (||x ˆyi||2 ˆL)

= ||x yi||2 ||x ˆyi||2 + ||x ˆyz||2 ||x yz||2

= ˆyi yi, 2x yi ˆyi + yz ˆyz, 2x ˆyz yz ||ˆyi yi|| ||2x yi ˆyi|| + ||yz ˆyz|| ||2x ˆyz yz|| ||yi ˆyi|| (||x ˆyi|| + ||x yi||) + ||ˆyz yz|| (||x yz|| + ||x ˆyz||)

n (||x ˆyi|| + ||x yi|| + ||x yz|| + ||x ˆyz||)

Using (6), we obtain

exp{1i =z 1

exp 1i =z max 0, 1

Plugging (6), (7), (9), and (10) into (5), we obtain

||ˆyi yi|| 2 L + Ei + ||yi ˆyi||

exp{1i =z 1

2Ei} 2(L + Ei + ||x y ||2 2)

||ˆyi yi|| 2 L + Ei + ||yi ˆyi||

exp{1i =z 1

2(L + Ei + ||x y ||2 2)

exp 1i =z max 0, 1

L + Ei + ||x y ||2 2 exp 1i =z 1

Next, recalling the fact that our class means form an ε-net on the radius-R ball, we use Lemma 5.2 from [40] to bound L as L 2R

d K + Ei + ||x y ||2 2 exp 1i =z 1

Next, we will consider cases on Ei. We will consider the cases in which Ei < log R2 and Ei log R2.

We will begin with the case in which Ei < log R2, then Equation 11 becomes:

d K + ||x y ||2 2

Next, the case in which Ei log R2, then Equation 11 becomes:

d K + 2 log R + ||x y ||2 2 R

d K + ||x y ||2 2

Setting M K to be the number of terms for which Ei < log R2 holds and by combining the two bounds, we obtain the following:

+ K||x y ||2 2

Ultimately, our bound is

|Ψb P(y ) ΨP(y )| O

+ K||x y ||2 2

Next, we will obtain a lower bound on Ψb P(ˆy ) ΨP(y ).

|Ψb P(ˆy ) ΨP(y )| |ΨP(y ) ΨP(ˆy )| |ΨˆP(ˆy ) ΨP(ˆy )| triangle inequality

αd2(y , ˆy ) |Ψb P(y ) ΨP(y )| Assumption A.2.

Combining both of these bounds with Equation 2, we obtain αd2(y , ˆy ) |Ψb P(y ) ΨP(y )| Ψb P(ˆy ) ΨP(y ) Ψb P(y ) ΨP(y )

|Ψb P(y ) ΨP(y )|

αd2(y , ˆy ) 2|Ψb P(y ) ΨP(y )|

d2(y , ˆy ) O

d K + K||x y ||2 2

E(x,y ) Dtest[d2(y , ˆy )] O

d K + d K !

Next, we obtain a bound on M. Since M K is the number of terms for which Ei < log R2, we have that M = V K, where V [0, 1] is the ratio of the volumes of the ball for which Ei < log R2, and the ball containing the entire set of classes Y. We have Ei = x yi 2 2 x yz 2 2 < log R2

x yi 2 2 < log R2 + L

log R2 + 2R

log R2 + 2R

which is an upper bound of the radius of the ball for which Ei < log R2 is true. Now we construct V :

log R2 + 2R

Combining this with our error bound, we have E(x,y ) Dtest[d2(y , ˆy )]

However, we can choose K to weaken our dependence on R.

log R2 + 2R

Finally, we have

E(x,y ) Dtest[d2(y , ˆy )] O

which completes the proof.

Discussion of Theorem A.3. The above bound contains several standard quantities, including the dimensionality of the inputs d, the radius R of the ball from which labels are drawn, the number of classes and samples used to estimate the model K and n respectively, and δ, which is used to control the probability that the bound holds. Naturally, the bound scales up with higher d and K, and improves with an increased number of samples n. The dependence on R is also related to its dependence on d so long as we have that K 2(

d log R/R)) 1 + 1 and for d 2, R increases, which improves the bound. The bound includes a metric space dependent quantity α, which is related to the amount that the Fréchet variance can change subject to changes in its argument, i.e., which node is considered.

A.2 Proof of Theorem A.4 (minimum locus cover for trees)

Theorem A.4 (Minimum locus cover for trees). The minimum locus cover for a tree graph T is Λ = Leaves(T).

Proof. We would like to show to show two things:

1. any node can be resolved by an appropriate construction of w using only the leaves and

2. removing any leaf makes that leaf node unreachable, no matter what the other nodes are in y i.e., we must use at least all of the leaf nodes.

If both of these conditions hold (i.e., that the leaves form a locus cover and that they are the smallest such set), then the the leaves of any tree form a minimum locus cover. To set up the problem, we begin with some undirected tree, T, whose nodes are our set of labels: T = (Y, E) where Λ = Leaves(T) Y is the set of leaf nodes. Moreover, let Λ be a tuple of the leaf nodes. We will start by proving that Λ is a locus cover. We will show this by cases on v Y, the node that we would like to resolve:

1. If v Λ, i.e. v is a leaf node, then setting wi = 1{Λi = v} yields

mΛ(w) = arg min y Y

i=1 1{Λi = v}d2(y, Λi)

= arg min y Y d2(y, v) = {v}.

2. If v Λ, we have a bit more work to do. Since v is an internal node, consider any pair of leaves, l1 = l2 such that v is along the (unique) path between l1 and l2: v Γ(l1, l2). Then

d(v,l2) d(l1,l2) if Λi = l1, d(v,l1) d(l1,l2) if Λi = l2, 0 otherwise

mΛ(w) = arg min y Y

d(v, l2)d2(y, l1) + d(v, l1)d2(y, l2)

= arg min y Y d(v, l2)d2(y, l1) + d(v, l1)d2(y, l2)

Thus Λ is a locus cover. Next, we will show that it is the smallest such set. Let l Λ be any leaf node, and define Λ = Λ \ {l}. We must show that Λ is not a locus cover. Assume for contradiction that Λ is a locus cover. This means that given some tuple y whose entries are the elements of Λ , we have Π(Λ ) = Y. This implies that the path between any two nodes in Λ is also contained in Π(Λ ). Since the leaves form a locus cover and any y Y can be constructed by the appropriate choice of w along with a path between two leaf nodes, l must either be one of the entries of Λ (one of the endpoints of the path) or it must be an internal node. Both cases are contradictions Λ cannot include l by assumption, and l is assumed to be a leaf node. It follows that Λ is a minimum locus cover.

A.3 Proof of Lemma A.5 (tree pairwise decomposability)

Lemma A.5 (Tree pairwise decomposability). Let T = (Y, E) be a tree and Λ Y. Then Π(Λ) is pairwise decomposable.

Proof. Assume for contradiction that y Y where y / Π({λi, λj}), λi, λj Λ, but y Π(Λ). Then, y mΛ(w) for some w. Note that y / Π({λi, λj}) implies y / Γ(λi, λj), λi, λj Λ, because Π({λi, λj}) = Γ(λi, λj) due to the uniqueness of paths in trees. As such, λ ΛΓ(y , λ) must contain an immediate relative y of y where

i=1 wid2(y , λi) =

i=1 wi(d(y , λi) + 1)2

i=1 wid2(y , λi).

Therefore, y / Π(Λ) because y is not a minimizer of P|Λ| i=1 wid2(y, λi). So, it must be that y Π({λi, λj}) for some λi, λj Λ when y Π(Λ).

A.4 Proof of Theorem A.6 (Algorithm B.1 correctness)

Theorem A.6 (Algorithm B.1 correctness). Algorithm B.1 returns a locus cover for phylogenetic trees.

Proof. We first prove that Algorithm B.1 will halt, then according to the stopping criterion, it is guaranteed that Algorithm B.1 returns a locus cover. Suppose the algorithm keeps running until it reaches the case in which all of the leaf nodes are included in Λ (i.e. Λ = Y), this results in the trivial locus cover of the phylogenetic tree and the algorithm halts.

A.5 Proof of Lemma A.7 (phylogenetic tree pairwise decomposability)

Lemma A.7 (Phylogenetic tree pairwise decomposability). Let T = (V, E) be a tree with Y = Leaves(T) and Λ Y. Then Π(Λ) is pairwise decomposable.

Proof. Suppose to reach a contradiction that y Y where y Π(Λ), but y / Π({λi, λj}) λi, λj Λ. In other words, y is a Fréchet mean for some weighting on Λ, but y is not a Fréchet mean for any weighting for any two vertices within Λ.

For arbitrary λi, λj Λ, define the closest vertex to y on the path between λi and λj as

v (λi, λj) = arg min v Γ(λi,λj) d(y , v).

Also, let {λ1, λ2} = arg min λi,λj Λ d(y , v (λi, λj)),

the pair of vertices in Λ whose path Γ(λ1, λ2) passes closest to y . To avoid notational clutter, define v = v (λ1, λ2). Thus, v represents the closest vertex to y lying on the path between any two λi, λj Λ.

Because v lies on the path Γ(λ1, λ2), there exists some weighting w for which

v = arg min v Γ(λ1,λ2) (w1d2(v, λ1) + w2d2(v, λ2)).

For this w, we have that m{λ1,λ2}(w) = arg miny Y d(y, v ), the set of leaf nodes of smallest distance from v . By the initial assumption that y Π({λ1, λ2}), we have that y m{λ1,λ2}(w). Let y be a vertex of m{λi,λj}(w) that is closest to y . Clearly, d(y , v )2 < d(y , v )2.

For any v V, respectively define the sets of vertices (1) shared in common on the two paths Γ(v, y ) and Γ(v, y ), (2) on the path Γ(v, y ) that are not on Γ(v, y ), and (3) on the path Γ(v, y ) that are not on Γ(v, y ) as

d C(v) = Γ(v, y ) Γ(v, y ),

dy (v) = Γ(v, y ) \ Γ(v, y ), and

dy (v) = Γ(v, y ) \ Γ(v, y ).

We have that |dy (v)| + |dy (v)| + 1 = |Γ(y , y )| (the difference by 1 represents the final vertex shared in common by both paths) and also that |dy (v )| < |dy (v )| since y is closer to v than y is.

Suppose λ3, λ4 Λ yielding

u = v (λ3, λ4) Γ(y , y ) such that d2(y , u ) d2(y , u ).

This is to say that u is defined analogously to v , with the additional restrictions that u lie on the path Γ(y , y ) and u be at least as close to y as to y . Then, |dy (u )| < |dy (v )| implying d2(y , u ) < d2(y , v ), which contradicts the definition of v .

Suppose now that λ3, λ4 Λ yielding

u = v (λ3, λ4) / Γ(y , y ) such that d2(y , u ) d2(y , u ).

This is to say that u is again defined analogously to v , but u is not on Γ(y , y ) and is at least as close to y as to y . Then, either λ Λd C(λ) = (i.e. all lambdas lie in the same branch off of Γ(y , y )), in which case the path between any two lambdas will have the same closest node v to y , or it is possible to choose a λ5 Λ such that d C(λ3) d C(λ5) = . In this situation, it must be that Γ(λ3, λ5) Γ(y , y ) = , which implies there exists a vertex t Γ(λ3, λ5) such that d2(y , t) < d2(y , u ) which violates the definition of u .

Altogether, for all v Γ(λi, λj) with arbitrary λi, λj Λ, we have that d2(y , v) < d2(y , v). This implies that for all λ Λ, there exists a y Y such that d2(y , λ) < d2(y , λ) and therefore that y / Π(Λ) which contradicts that y Π(Λ). It must be that y Y where y Π(Λ), we have y Π({λi, λj}) λi, λj Λ.

A.6 Proof of Theorem A.8 (minimum locus cover for grid graphs)

Theorem A.8 (Minimum locus cover for the grid graphs). The minimum locus cover for a grid graph is the pair of vertices in the furthest opposite corners.

Proof. We would like to show two things:

1. any vertex can be resolved by construction of w using the furthest pair of two vertices (i.e. opposite corners) and

2. there does not exist a locus cover of size one for grid graphs with more than one vertex.

Let G = (V, E) be a grid graph and let Λ = {λ1, λ2} be a set of two vertices who are on opposite corners of G (i.e., perhipheral vertices achieving the diameter of G). For notational convenience, set Λ = (λ1, λ2) We can reach any interior vertex by an appropriate setting the weight vector

w = [w1, w2]. Then set w = d(v,λ2) d(λ1,λ2), d(v,λ1)

d(λ1,λ2) . Finally, the Fréchet mean is given by

mΛ(w) = arg min y Y

d(v, λ2)d2(y, λ1) + d(v, λ1)d2(y, λ2)

= arg min y Y d(v, λ2)d2(y, λ1) + d(v, λ1)d2(y, λ2)

Hence Λ is a locus cover. We can clearly see that Λ is a minimum locus cover the only way to obtain a smaller set Λ is for it to include only a single vertex. However, if Λ = {v } contains only a single vertex, it cannot be a locus cover so long as G contains more than one vertex v is always guaranteed to be a unique minimizer of the Fréchet mean under Λ which misses all other vertices in G. Thus Λ is a minimum locus cover.

A.7 Proof of Lemma A.9 (loci of grid subspaces)

Lemma A.9 (Locus of grid subspaces). Given any pair of vertices in Λ, we can find a subset G of the original grid graph G = (Y, E) which takes the given pair as two corner. Π(Λ) equals to all the points inside G .

Proof. The result follows from application of Theorem A.8 to the choice of metric subspace.

A.8 Proof of Lemma A.10 (grid pairwise decomposability)

Lemma A.10 (Grid pairwise decomposability). Let G = (Y, E) be a grid graph and Λ Y. Then Π(Λ) is pairwise decomposable.

Proof. Suppose we have a grid graph G = (Y, E) and a vertex ˆy Π(Λ) with Λ Y. Due to the fact that ˆy Π(Λ) and the fact that G is a grid graph, we have that ˆy Γ(λα, λβ) for some λα, λβ Λ. Then by Lemma A.10,

ˆy Π({λα, λβ}) λi,λj ΛΠ({λi, λj}).

Therefore loci on grid graphs are pairwise decomposable.

A.9 Proof of Theorem A.11 (no nontrivial locus covers for complete graphs)

Theorem A.11 (Trivial locus cover for the complete graph). There is no non-trivial locus cover for the complete graph.

Proof. We show that there is no nontrivial locus cover for complete graphs by showing that removing any vertex from the trivial locus cover renders that vertex unreachable. We proceed by strong induction on the number of vertices, n. Base case We first set n = 3. Let K3 = (V, V2) be the complete graph with three vertices: V = {v1, v2, v3}, and without loss of generality, let Λ = {v2, v3} be our set of observed classes. There are two cases on the weight vector w = [w2, w3]: Case 1: Suppose w int 2. This means that either w2 = 1 or w3 = 1 which leads to the Fréchet mean being either v2 or v3, respectively. Neither of these instances correspond to v1 being a minimizer. Case 2: Suppose w int 2. Then the Fréchet mean is given by

mλ(w) = arg min y Y w2d2(y, v2) + w3d2(y, v3)

Assume for contradiction that v1 Π(Λ):

w2d2(v1, v2) + w3d2(v1, v3) = w2 + w3 > w3 because w int 2

= w2d2(v2, v2) + w3d2(v2, v3).

Therefore, v1 / Π(Λ). This is a contradiction. Thus there is no nontrivial locus cover for K3.

Inductive step: Let Kn 1 = (V, V2) be the complete graph with n 1 vertices. Assume that there is no nontrivial locus cover for Kn 1. We will show that there is no nontrivial locus cover for Kn. Let the weight vector be w = [w1, ...wn 1] corresponding to vertices v1, ..., vn 1 V with Λ = {v1, ..., vn 1}. We want to show that vn Π(Λ). We proceed by cases on w. Case 1: Suppose w int n 2 where m entries of w are zero, then by strong induction we know that vn Π(Λ) = {vi}n m i=1 , i.e., there is no nontrivial locus cover for Kn m. Case 2: Suppose w int n 2.

Assume for contradiction that vn Π(Λ). Using this assumption, we obtain the following

i=1 wid2(vn, vi) =

i=1 wi because w int n 2

i=1 wid2(vn 1, vi).

Therefore, vn is not a minimizer, so vn / Π(Λ). This is a contradiction, hence Kn has no nontrivial locus cover.

A.10 Proof of Theorem 4.8 (active next-class selection for trees)

Proof. We will prove the result by showing that the two optimization problems are equivalent. Let v be a solution to the following optimization problem:

arg max y Y \ Π(Λ) d(y, b)

s.t. b in T , Γ(b, y) \ {b} Y \ Π(Λ),

where in T is the inner boundary of T , the subgraph of T whose vertices are Π(Λ). This optimization problem can be equivalently rewritten as

arg max y Y \ Π(Λ) |Γ(y, b)|

s.t. b in T , Γ(b, y) \ {b} Y \ Π(Λ),

and we can furthermore introduce additional terms that do not change the maximizer

arg max y Y \ Π(Λ) | λi,λj Λ Γ(λi, λj) Γ(b, y)|

s.t. b in T , Γ(b, y) \ {b} Y \ Π(Λ).

Equivalently, we can also connect b to one of the elements of Λ

arg max y Y \ Π(Λ) | λi,λj Λ Γ(λi, λj) Γ(λi, b) Γ(b, y)|

s.t. b in T , Γ(b, y) \ {b} Y \ Π(Λ).

Due to the uniqueness of paths in trees, this optimization problem also has the following equivalent form without any dependence on b:

v arg max y Y\Λ | λi,λj Λ Γ(λi, λj) Γ(λi, y)| = arg max y Y\Λ | λi,λj Λ Π({λi, λj}) Π({λi, y})|

= arg max y Y\Λ | λi,λj Λ {y} Π({λi, λj})|

= arg max y Y\Λ |Π(Λ {y})|

using Lemma A.5.

Therefore v is a maximizer of |Π(Λ {v})|, as required.

B Algorithms and Time Complexity Analyses

We provide time complexity analyses for Algorithms B.1, B.2, and B.3.

B.1 Analysis of Algorithm B.1 (locus cover for phylogenetic trees)

We provide Algorithm B.1 with comments corresponding to the time complexity of each step.

Algorithm 1 Locus cover for phylogenetic trees Require: phylogenetic tree T = (V, E), Y = Leaves(T)

N |Y| P sortbylength([Γ(yi, yj)]i,j [N]) N|E| + N 2 log N P reverse(P) O(N 2) Λ for Γ(yi, yj) in P do O(N 2) if Π(Λ) = Y then O(K2D max{N|E|, N 2 log N}) return Λ else

Λ Λ {yi, yj} end if end for

Combining these, we obtain the following time complexity:

O(N|E|+N 2 log N+N 2+N 2K2D max{N|E|, N 2 log N}) = O(N 2K2D max{N|E|, N 2 log N}).

B.2 Analysis of Algorithm B.2 (computing a pairwise decomposable locus)

We first provide Algorithm B.2 here, with comments corresponding to the time complexity of each step.

Algorithm 2 Computing a pairwise decomposable locus Require: Λ, Y, G = (V, E)

Π D diam(G) O(N|E| + N 2 log N) for λi, λj Λ do O(K2) for w1 in 0

D do O(D) w [w1, 1 w1] Π Π m{λi,λj}(w) O(N|E| + N 2 log N) end for end for return Π

We first compute the diameter of the graph G = (Y, E) with N = |Y|, which is done in O(N|E| + N 2 log N) time using Dijkstra s algorithm to compute the shortest paths between all pairs of vertices. We then iterate over all pairs of elements in Λ with K = |Λ|, which amounts to O(K2) iterations. Within this, we perform O(D) computations of the Fréchet mean, for which each iteration requires O(N|E| + N 2 log N) arithmetic operations or comparisons. Combining these, the final time complexity is

O(N|E| + N 2 log N + K2D(N|E| + N 2 log N)) = O(K2D max{N|E|, N 2 log N}).

B.3 Analysis of Algorithm B.3 (computing a generic locus)

We provide Algorithm B.3 with comments corresponding to the time complexity of each step.

Following a similar argument from the analysis of Algorithm B.2, the time complexity is

O(N|E| + N 2 log N + DK(N|E| + N 2 log N)) = O(DK),

Algorithm 3 Computing a generic locus Require: Λ, Y, G = (V, E)

Π D diam(G) O(N|E| + N 2 log N) for w in 0

D |Λ| do O(DK) Π Π mΛ(w) O(N|E| + N 2 log N) end for return Π

0.2 0.4 0.6 0.8 1.0 Confidence

Reliability

ECE optimal CLIP default Loki optimal Perfect calibration

7.20 7.25 7.30 7.35

ECE vs. Loki perf.

ECE optimal CLIP default Loki optimal

Softmax temperature

7.20 7.25 7.30 7.35 Expected squared distance

Zero-one error

CIFAR-100 default tree

error tradeoff

Argmax prediction CLIP default

Softmax temperature

0.2 0.4 0.6 0.8 1.0 Confidence

Reliability

ECE optimal Loki optimal CLIP default Perfect calibration

20.0 22.5 25.0 27.5 30.0

ECE vs. Loki perf.

ECE optimal CLIP default Loki optimal

Softmax temperature

20 22 24 Expected squared distance

Zero-one error

CIFAR-100 Word Net tree

error tradeoff

Argmax prediction CLIP default

Softmax temperature

Figure 4: CLIP on CIFAR-100 with the Word Net hierarchy. (Left) Reliability diagrams across a range of Softmax temperatures, highlighting the CLIP default temperature, the optimal temperature for LOKI, and the minimizer of the Expected Calibration Error (ECE). All three are well-calibrated. (Center) Tradeoff between optimizing for ECE and the expected squared distance. As with the reliability diagrams, the CLIP default temperature, the LOKI-optimal temperature, and the ECEoptimal temperature are similar. (Right) Tradeoff between optimizing for zero-one error and the expected squared distance. Depending on the metric space, temperature scaling can improve accuracy.

where K is the number of observed classes.

C Additional Experiments

C.1 Additional calibration experiments

In the main text, we evaluated the effect of calibrating the Softmax outputs on the performance of LOKI using a metric space based on the internal features of the CLIP model. Here, we perform the same analysis using two external metric spaces.

Setup We perform our calibration analysis using the default and Word Net tree.

Results Our results are shown in Figure 4. Like our experiment using an internally-derived metric, we find that the optimal Softmax temperature is close to the CLIP default and the optimal temperature for calibration. For the default tree, we again found that temperature scaling can be used to improve accuracy using LOKI.

Setup We calibrate via Softmax temperature scaling [13] using CLIP on CIFAR-100. We do not use an external metric space, and instead use Euclidean distance applied to the CLIP text encoder.

Results The reliability diagram in Figure 3 shows that the optimal Softmax temperature for LOKI is both close to the default temperature used by CLIP and to the optimally-calibrated temperature. In Figure 3 (right), we find that appropriate tuning of the temperature parameter can lead to improved accuracy with CLIP, even when no external metric space is available.

C.2 Ablation of LOKI formulation

LOKI is based on the Fréchet mean, which is defined as arg miny Y PK i=1 Pλi|xd2(y, λi). However, this is not the only approach that can be considered. For example, the Fréchet median, often used in robust statistics, is defined as arg miny Y PK i=1 Pλi|xd(y, λi), without squaring the distances. More generally, we define arg miny Y PK i=1 Pλi|xdβ(y, λi) and evaluate different choices of β.

Setup We conduct this experiment on Image Net using Sim CLR as our pretrained classifier with K = 250, 500, and 750 randomly selected classes.

Results From this analysis, we conclude that using the Fréchet mean is the optimal formulation for Loki, as it achieved the lowest mean squared distance for all settings of K.

Table 4: Expected squared distances of Sim CLR+Loki alternatives on Image Net. We ablate over the choice of distance exponent in LOKI (where β = 2, corresponding to the Fréchet mean), including the Fréchet median (β = 1). That is, we tune β : ˆy arg miny Y PK i=1 Pλi|xdβ(y, λi) and find that the optimal setting is β = 2, corresponding to LOKI.

β = K = 250 K = 500 K = 750

0.5 61.28 46.57 36.31 1 52.98 41.06 33.14 2 (Loki) 46.78 37.76 29.88 4 47.99 39.58 34.65 8 65.29 58.42 54.90

C.3 Comparison across metric spaces

Expected squared distances cannot be directly compared across metric spaces, as they may be on different scales. Our solution is to use a form of normalization: we divide the expected squared distance by the square of the graph diameter. This brings all of the values to the 0-1 range, and since E[d2(y, ˆy)]/diam(G)2 indeed also generalizes the 0-1 error, this enables comparison between 0-1 errors and those from different metric spaces. We provide these results in Table 1, again for our CLIP experiments on CIFAR-100. This evaluation metric enables us to determine which metric spaces have geometry best fit to our pretrained models.

Setup Using E[d2(y, ˆy)]/diam(G)2, we re-evaluate our CLIP results on CIFAR-100 shown in Table 1 for the Res Net-50 architecture.

Results For CIFAR-100, we observe that the Word Net metric space resulted in the lowest error and therefore has the best geometry.

Table 5: Comparison across metric spaces for CLIP on CIFAR-100 by normalizing by the squared diameter of the metric space: E[d2(y, ˆy)]/diam(G)2.

Metric space G diam(G)2 E[d2(y, ˆy)]/diam(G)2

Complete graph 1 0.5941 (0-1 error) Default tree 16 0.4493 Word Net tree 169 0.1157 CLIP features 0.9726 0.2686

D Experimental Details

In this section, we provide additional details about our experimental setup.3

D.1 CLIP experiments

All experiments are carried out using CLIP frozen weights. There are no training and hyperparameters involved in experiments involving CLIP, except for the Softmax temperature in the calibration analysis. We evaluted using the default CIFAR-100 test set. The label prompt we use is "a photo of a [class_name]."

D.2 Image Net experiments

To construct our datasets, we randomly sample 50 images for each class from Image Net as our training dataset then use the validation dataset in Image Net to evaluate LOKI s performance. We extract the image embeddings by using Sim CLRv14 and train a baseline one-vs-rest classifier. We use Word Net phylogenetic tree as the metric space. The structure of phylogenetic tree can be found at here5. We test different numbers of observed classes, K, from 1000 classes. Observed classes are sampled in two ways, uniformly and Gibbs distribution with the concentration parameter 0.5.

D.3 LSHTC experiments

We generate a summary graph of the LSHTC class graph (resulting in supernodes representing many classes) by iteratively:

1. randomly selecting a node or supernode 2. merging its neighbors into the node to create a supernode

until the graph contains at most 10,000 supernodes. In the LSHTC dataset, each datapoint is assigned to multiple classes. We push each class to its supernode, then apply majority vote to determine the final class. We test different numbers of observed classes, K, from 10,000 classes. We collect datapoints which are in the observed classes. Then split half of dataset as training dataset and make the rest as testing dataset, including those datapoints which are not in the observed classes. We train a baseline classifier using a 5-NN model and compare its performance with LOKI.

E Broader Impacts and Limitations

As a simple adaptor of existing classifiers, it is possible for our method to inherit biases and failure modes that might be present in existing pretrained models. On the other hand, if the possibility of harmful mis-classifications are known a priori, information to mitigate these harms can be baked into the metric space used by LOKI. Finally, while we have found that LOKI often works well in practice, it is possible for the per-class probabilities that are output by a pretrained model to be sufficiently mis-specified relative to the metric over the label space, or for the metric itself to be mis-specified. Nonetheless, we have found that off-the-shelf or self-derived metric spaces to work well in practice.

F Random Locus Visualizations

In Figures 1, 5, 6, 7, 8, and 9, we provide visualizations of classification regions of the probability simplex when using LOKI with only three observed classes out of 100 total classes, and different types of random metric spaces. The first example in Figure 1 shows the prediction regions on the probability simplex when using standard arg max prediction the three regions correspond to predicting one of the three classes (0, 39, and 99) and no regions corresponding to any of the other classes {1, ..., 38, 40, ..., 98}. We compute these regions using a version of Algorithm B.3, and while

3Code implementing all of our experiments is available here: https://github.com/Sprocket Lab/loki. 4Embeddings are extracted from the checkpoints stored at: https://github.com/tonylins/ simclr-converter 5https://github.com/cvjena/semantic-embeddings/tree/master/ILSVRC

it does have exponential time complexity, the exponent is only three in this case since we consider only three observed classes.

On the other hand, the other three examples in Figure 1 and Figures 5, 6, 7, 8, and 9 all show the prediction regions on the probability simplex when using LOKI. Figure 5 shows this for random tree graphs. Here, the prediction regions are striped or contain a single triangle-shaped region in the center these correspond, respectively, to intermediate classes along branches of the tree leading up from the observed class and the prediction region formed by the common-most parent node. Figure 6 shows similar regions, although these are more complex and are thus more difficult to interpret, as phylogenetic trees are metric subspaces equipped with the induced metric from trees. Furthermore, in order to generate phylogenetic trees with 100 leaves, we needed to create much larger trees than the ones used for Figure 5, which led to narrower prediction regions due to the higher graph diameter. Finally, Figures 7, 8, and 9 each show the prediction regions using LOKI when the metric space is a random graph produced using Watts-Strogatz, Erd os-Rényi, and Barabási Albert, models respectively. These prediction regions are more complex and represent complex relationships between the classes.

Figure 5: Classification regions in the probability simplex of 3-class classifiers faced with a 100-class problem where the classes are related by random trees graphs.

Figure 6: Classification regions in the probability simplex of 3-class classifiers faced with a 100-class problem where the classes are related by random phylogenetic trees graphs.

Figure 7: Classification regions in the probability simplex of 3-class classifiers faced with a 100-class problem where the classes are related by random small-world graphs generated by a Watts Strogatz model.

Figure 8: Classification regions in the probability simplex of 3-class classifiers faced with a 100-class problem where the classes are related by random Erd os-Rényi graphs.

Figure 9: Classification regions in the probability simplex of 3-class classifiers faced with a 100-class problem where the classes are related by random Barabási Albert graphs.