# hierarchical_novelty_detection_via_finegrained_evidence_allocation__474e6baf.pdf Hierarchical Novelty Detection via Fine-Grained Evidence Allocation Spandan Pyakurel 1 Qi Yu 1 By leveraging a hierarchical structure of known classes, Hierarchical Novelty Detection (HND) offers fine-grained detection results that pair detected novel samples with their closest (known) parent class in the hierarchy. Prior knowledge on the parent class provides valuable insights to better understand these novel samples. However, traditional novelty detection methods try to separate novel samples from all known classes using uncertainty or distance based metrics so they are incapable of locating the closest known parent class. Since the novel class is also part of the hierarchy, the model can more easily get confused between samples from known classes and those from novel ones. To achieve effective HND, we propose to augment the known (leaf-level) classes with a set of novel classes, each of which is associated with one parent (i.e., non-leaf) class in the original hierarchy. Such a structure allows us to perform novel fine-grained evidence allocation to differentiate known and novel classes guided by a uniquely designed loss function. Our thorough theoretical analysis shows that fine-grained evidence allocation creates an evidence margin to more precisely separate known and novel classes. Extensive experiments conducted on real-world hierarchical datasets demonstrate the proposed model outperforms the strongest baselines and achieves the best HND performance. 1. Introduction Novelty detection aims to tackle the challenging real scenarios, where test samples may come from previously unseen classes outside of the training distribution. Various novelty detection techniques have been developed with promising detection performance (Chen et al., 2021; Vaze et al., 2021; 1Rochester Institute of Technology, Rochester, New York. Correspondence to: Qi Yu . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). Do Not Enter Novel Regulatory True Label HND Food Novel Recreation Stop Yield Hiking Figure 1. An illustrative example of Hierarchical Novelty Detection (HND): (a) A hierarchy of traffic signs with known classes (Stop, Yield, Hiking Trail and Picnic Area); (b) Testing samples from both known and novel classes. Chen et al., 2020; Zhang et al., 2020). Uncertainty or distance based metrics are commonly leveraged to quantify how a testing sample is different from known ones. However, most existing methods only provide a binary detection result, indicating whether the sample is novel or not. Such a coarse-grained result does not offer additional insight on the nature of the novel sample to further inform decision-making. For example, when detecting a new type of malware, it may be beneficial to identify the closest software family it belongs to, which can help security engineers quickly develop a defense strategy. Similar cases can be found in many other domains: when a newly synthesized protein is discovered, locating the most similar existing protein type can equip biologists with valuable prior knowledge to study the novel one and advance scientific discovery. To perform fine-grained novelty detection, it is beneficial to leverage existing hierarchical structures that humans commonly use to organize information. For example, most real-world objects can be described using a hierarchical structure based on their relationship with other relevant objects. Many benchmark datasets also organize the training classes into a hierarchical structure. With a hierarchy of known classes in place, fine-grained novelty detection can be achieved by simultaneously performing novelty detection while accurately identifying a parent class within the hierarchy that the novel sample is most similar to. We refer to this problem as hierarchical novelty detection (HND). As shown in Figure 1, given a set of known types of traffic Hierarchical Novelty Detection via Fine-Grained Evidence Allocation signs organized by a hierarchy in (a) used in model training, the testing samples may come from known classes (Stop) or represent new types of traffic signs, including Do Not Enter and Food as shown in (b). For those novel samples, a properly trained HND not only needs to detect that they are not part of the existing hierarchy but also assign them to the closest parent class: Do Not Enter Novel Regulatory and Food Novel Recreational. To achieve good novelty detection performance, a general detection model tries to separate novel data samples from known classes as much as possible. As a result, the model tends to assign a high uncertainty (or distance) score for novel data samples so that they can be clearly differentiated from known samples. However, directly applying existing novelty detection techniques does not meet the unique requirement of HND. A fundamental challenge lies in that the novel samples are no long totally unbounded as in the standard novelty detection setting. In contrast, they are also part of the hierarchy and the novelty arises because their corresponding class was not included during the training time. Thus, simply assigning a high uncertainty/distance score to a novel data sample may push it outside the entire hierarchy, hence is not able to identify a close parent class to better understand the nature of the sample. HND is only sparsely pursued by existing efforts. One viable solution is to conduct hierarchical classification (HC) augmented with Novelty Detection techniques (Lee et al., 2018; Wang et al., 2022) (referred to as HC-ND). For each node followed by the HC process, a novelty score is predicted and compared with a pre-defined threshold to determine whether HC-ND should continue or stop. For samples from a known class, HC should proceed to the bottom layer of the hierarchy and assign them to the corresponding leaf node; for a novel sample, HC-ND should identify a right non-leaf node to stop when sufficient novelty is detected, making it impossible to further assign it into one of the existing child classes. The effectiveness of HC-ND heavily hinges on the HC model, as a mistake made at any point during the hierarchical classification process will result in a wrong detection result. Consequently, the detection error accumulates quickly with the depth of the hierarchy. Furthermore, a different novelty threshold may be assigned depending on the depth of the hierarchy, which further complicates the detection process. To avoid a fast accumulating detection error in HC-HD, one can convert a multi-level hierarchy into a flat structure (Lee et al., 2018). To allow a novel data sample to be assigned to any non-leaf node as its closest parent class, the flat structure augments all known leaf classes with a set of novel classes, each of which associates with one non-leaf node in the original hierarchy. Figure 2 shows an example of the flat structure, where the augmented novel classes are highlighted Stop Yield Hiking Novel Recreation Novel Regulatory Novel Traffic Figure 2. Flatten the hierarchy of Figure 1(a) to convert the problem into a multi-class (leaf classes and novel non-leaf classes) classification problem. Novel non-leaf classes include Novel Traffic Sign, Novel Regulatory Sign, and Novel Recreation Sign. in red. One remaining challenge lies in the lack of training samples from the novel classes. To overcome this, a Leave One-Out (LOO) training process has been developed that iteratively removes classes from the hierarchy and treats them as the novel class to support training. This process can effectively avoid assigning a novel sample to any known classes. Nevertheless, since samples from known classes are used as novel ones during training, the model may have trouble in differentiating the known classes from the novel one during testing (as shown in Figure 5). To achieve effective HND, we propose to conduct novel fine-grained evidence allocation for hierarchical novelty detection. By leveraging the flattened structure, we perform evidence-based multi-class classification to train a model that can allocate different amounts of evidence to known classes and novel ones, respectively, which forms a margin to separate them more precisely. In particular, for testing samples from known classes, the model is trained to assign high evidence to the corresponding leaf class, so it can be clearly differentiated from other known classes as well as the novel classes; for a novel data sample, the model can assign moderate evidence to the corresponding novel class while ensuring a low evidence to all other classes. Model training is guided by a uniquely designed loss function with strong theoretical guarantees to create an evidence margin for improved HND detection. In addition, prior belief on the existence of certain novel classes can be incorporated in a principled way by adjusting the base rate in the innovative evidential formulation. Our empirical results confirm that HND performance can indeed benefit from such prior belief. Our contribution of the paper is threefold: We propose a novel method, referred to as evidential hierarchical novelty detection (E-HND) that leverages fine-grained evidence to more precisely differentiate samples of known class from those of novel ones in the same hierarchy. We design a unique loss function that can create an evidence margin to ensure good separation of known and Hierarchical Novelty Detection via Fine-Grained Evidence Allocation Traffic Sign Yield Hiking Traffic Sign Yield Hiking Novel Regulatory Sign Traffic Sign Novel Traffic Sign Remove Stop Class Remove Regulatory (a) (b) (c) Figure 3. Working mechanism of leave-one-out (LOO) training. A training sample from the stop class is used to train HND. The green color represents Ground Truth(GT), and the dashed line represents the non-ground truth classes. (a) GT is the known leaf class (b) By removing the known leaf class (stop) from the hierarchy, GT becomes its novel parent class (Novel Regulatory sign). (c) By removing the Regulatory sign class, GT becomes the Novel Traffic sign class. novel samples with sound theoretical guarantees. We leverage the base rate in the evidential formulation to incorporate prior belief on the existence of novel classes. We perform extensive experiments on multiple real-world hierarchical datasets, which show the effectiveness of the proposed E-HND model. Comparison with the strongest known baselines shows that E-HND achieves the best HND performance to date. 2. Related Works Novelty Detection. The field of novelty detection aims to identify whether the sample is from a known or novel class. In order to perform novelty detection, some methods use maximum probability or maximum logit value as the score for assigning a sample to a known class (Vaze et al., 2021). Similarly, (Bendale & Boult, 2016) uses a separate class to assign the probability that a sample belongs to a novel class. (Chen et al., 2020; 2021; Yang et al., 2020) learn a prototype based on known classes and assigns the test sample to a known class on the basis of how close they are to prototypes. (Chen et al., 2021) further utilizes an adversarial learning-based training to generate novel samples to further improve the novelty detection performance. Moreover, there are various uncertainty-based methods (Sensoy et al., 2018; Malinin & Gales, 2018; Charpentier et al., 2020) that quantify the uncertainty measures to represent the uncertainty in prediction for novel samples. These methods can not be directly used in HND, as HND has a unique setting to identify the closest parent of the novel sample. Hence, in order to tackle the problem, we need to consider the hierarchical structure within known classes. Hierarchical Novelty Detection. There are various works (Chang et al., 2021; Chen et al., 2022; Zhao et al., 2021; Du et al., 2020) in the field of hierarchical classification that achieve promising results on identifying samples from fine-grained classes. However, these classifications are not equipped with novelty detection mechanisms. To introduce novelty detection in hierarchy, (Lee et al., 2018) uses KL divergence based confidence score for each local classifier. Further, in order to improve novelty detection, (Wang et al., 2022) uses fuzzy logic as an uncertainty measure. However, using multiple classifiers in the hierarchy causes errors to accumulate while making predictions. Also, it requires us to set multiple thresholds. To avoid the use of multiple thresholds and error propagation, (Lee et al., 2018) flattens the hierarchy to perform multi-class classification for leaf and non-leaf classes together. (Ruiz & Serrat, 2022) uses cosine loss to learn prototypes to leaf and non-leaf classes, and assigns the test sample to the closest learned prototype. These methods leverage the training of known samples as novel non-leaf classes, causing the model to confuse between known and novel classes in the testing phase. In order to address the problem, we conduct HND based on fine-grained evidence allocation that helps in the separation between known and novel classes. Novel Category Discovery (NCD). The goal of this field of research works (Han et al., 2019; 2020; 2021; Zhong et al., 2021b;a) is to discover novel categories from unlabeled samples, where the unlabeled data contain samples from only novel categories. The more realistic setting is to include unlabeled samples from both known and novel categories, as explored in Generalized Category Discovery (GCD) (Vaze et al., 2022; Rastegar et al., 2024). NCD and GCD along with HND go beyond the binary ID/OOD detection by assigning the detected novel samples into specific classes. Furthermore, an implicit binary hierarchical tree is learned to support the fine-grained categorization of the novel samples (Rastegar et al., 2024). In contrast, HND leverages an existing hierarchy of known classes to supervise the fine-grained categorization of the detected novel samples if they fall into the hierarchy. NCD and GCD methods categorize the novel samples through clustering, which could be less accurate due to the lack of detailed supervision as in HND. In contrast, for a novel sample that is outside of the existing hierarchy, where no fine-grained supervision is Hierarchical Novelty Detection via Fine-Grained Evidence Allocation Novel Regulatory Sign Novel Traffic Sign Hiking Trail Picnic Area Novel Ground Known Ground Dirichlet Parameter Logit Figure 4. A training image from Stop class is used by cross-entropy based loss function and E-HND (a) Logits mapped to ground truth and non-ground truth classes by LOO method (b) Dirichlet parameters mapped to ground truth and non-ground truth classes by E-HND available, NCD/GCD methods can be applied. Therefore, HND and NCD/GCD can be used in a complementary way depending on whether the detected novel samples belong to the existing hierarchy or not. 3. Methodology 3.1. Problem Formulation Let H denote a hierarchical relationship between classes. For a class y, let Pa(y), Ch(y), An(y) and De(y) denote the parent, children, ancestors and descendants of y in H, respectively. There are three types of classes: leaf class (with no children), non-leaf class (ancestor of a leaf class), and novel class (Ch(y) = ϕ and y / H). Out of these classes, only the leaf class and non-leaf class are known during training, forming the hierarchy H. Let N(y) denote the set of novel classes, whose closest known parent class in H is y. For a test sample belonging to a novel class N(y), the goal of HND is to predict y for that sample. As mentioned in the introduction, to avoid accumulating errors over the hierarchical classification process and setting layer specific novelty threshold, we leverage a flattened structure to conduct HND. Let Le(H) and NLe(H) represent the set of leaf and non-leaf classes, respectively. To cover the entire hierarchy, we associate each non-leaf class with a novel class, as shown in Figure 2. We refer to these novel classes as novel non-leaf classes. The flattened structure allows us to perform multi-class classification in oneshot by including both the known leaf classes and novel non-leaf classes. The model can output the probabilities for the known (p(kn) = [p(kn) 1 , ..., p(kn) |Le(H)|]) and novel non- leaf classes (p(no) = [p(no) 1 , ..., p(no) |NLe(H)|]). Once being trained, the model can perform HND by ˆk = argmax k [p(kn) 1 , ..., p(kn) |Le(H)|, p(no) 1 , ..., p(no) |NLe(H)|] (1) where ˆk represents the index of the class with the highest probability among known leaf and non-leaf novel classes. 3.2. Challenges of Model Training One key challenge in novelty detection is the lack of samples from novel classes during the training process. LOO is a technique leveraging a flattened structure for novel detection training using the samples solely from known classes (Lee et al., 2018). Let (x, y) be a pair of training sample with a leaf-level label y. To support known sample classification, LOO trains the model to output the highest probability value p(y|x) among all known leaf classes Le(H). To support novelty detection using samples in class y, it recursively removes each of its ancestors c An(y) from H, resulting in a new hierarchy H \ c. For each c, it maximizes the probability of the new ground truth as N(Pa(c)). As an example, consider a sample from Stop class as shown in Figure 3. The sample is first used to maximize the probability of Stop in comparison to non-ground truths shown in the dashed box as shown in (a). When the Stop Sign class is removed from the hierarchy, the same training sample is used to maximize the probability of Novel Regulatory Sign as shown in (b). Finally, when Regulatory Sign class is removed from the hierarchy, the training sample is used to maximize the probability of Novel Traffic Sign as shown in (c). Overall, the following loss function is used to minimize the cross-entropy: LCE(θ) = E p(x,y)[ ln p(y|x; θLe(H)) c An(y) ln p(N(Pa(c))|x; θN(P a(c)) Le(H\c))] Since the same data sample is used to maximize both the known ground truth (i.e., y) and the novel ground truth (i.e., N(Pa(c))) during training, it could lead to a conflict that causes confusion when using the model for testing. For example, given a test Stop image, the model could output a high probability for both the known ground true label and each of the novel ground true labels as shown in Figure 4(a). This kind of training does not allow the model to separate between known and novel samples in testing. The model allocates high logit values for both known and novel samples as we observed in CUB test dataset in Figure 5(a), compromising the novelty detection performance in practical settings. 3.3. Learning the Evidence Margin To avoid confusion of the model during testing, it is essential to use a more fine-grained loss function that can clearly separate samples from known and novel classes. Maximizing the class probability by minimizing the cross-entropy as in (2) is inadequate. To this end, we propose to conduct evidential HND, which performs fine-grained evidence-based training that guides the model to allocate distinct amounts of evidence to known and novel classes, respectively, resulting in a clear separation. Hierarchical Novelty Detection via Fine-Grained Evidence Allocation 100 105 110 115 Logit (a) LOO Testing Known Novel 0 10 20 30 Evidence (b) E-HND Testing Known Novel Figure 5. Comparison of the distribution of (a) logits and (b) evidences for known and novel test samples in CUB dataset Given a K-way multi-class problem, evidential learning trains a model to assign a belief mass distribution b = [b1, b2, ..., b K] along with an uncertainty mass u to multiclass forming a multinomial opinion w given by: w = (b, u, a), with k=1 bk + u = 1 (3) where a = [a1, a2, ..., a K] denotes the base rate distribution representing the prior probabilities associated with each class. The probability that a sample belongs to a class k is P(y = k) = bk + aku (4) Assume that the label distribution is governed by a parameter p = [p1, p2, ..., p K], i.e., P(y = k|pk) = pk, which allows us to obtain (4) by marginalizing p. Further, assume that p is drawn from a Dirichlet PDF D(p|α), where α = (α1, ..., αK) . The parameter αk represents the effective number of observations for class k. Let rk represents the observed evidence, then parameter αk is given by αk = rk + ak W (5) where W provides the weight to the base rates 1. Such an expression leads to an evidence based representation of class probability P(y = k): P(y = k) = E[pk] = αk PK k=1 αk = rk + ak W PK k=1 rk + W (6) Thus, given the ground-truth labels, training an evidential learning model can be achieved by maximizing the groundtrue label probability. This is equivalent to assigning high evidence to that label. On the other hand, the amount of evidence also reflects the confidence (or uncertainty) of the prediction. By comparing (6) and (4), when the model predicts a low evidence rk, the corresponding belief bk is low and the uncertainty mass u is high due to the summation constraint in (3). For novel samples, it is natural for the model to make a low-confidence prediction because the model has not been exposed to such samples. 1The default values ak and W are usually set to 1 K and K, respectively, leading to αk = rk + 1 in common settings. Taking advantage of the key properties offered by an evidence-based formulation, we propose a novel loss function that can form an evidence margin to clearly separate known and novel samples, leading to improved novelty detection performance. On one hand, since the model naturally provides low-confidence predictions for novel samples, the loss function simulates that behavior during the training phase by upper bounding the evidence allocated to the novel classes. On the other hand, for the known classes, it allows the model to predict much higher evidence, which ensures confident predictions on known samples. More formally, given the i-th training sample, the proposed loss function comprises two terms that work in a multitask fashion to allocate: (i) high evidence to the ground truth known leaf class and (ii) moderate evidence to the ground truth novel non-leaf classes. Li(θ) = L(1) i (θ) + L(2) i (θ) (7) L(1) i (θ) = KL D(pi|αi; θLe(H))||D(pi|ˆαi; θLe(H)) L(2) i (θ) = X c An(y) L(2) i,c (θ) L(2) i,c (θ) = KL D(pi|αi; θLe (H\c))||D(pi| αi; θLe (H\c)) The first term L(1) i (θ) is defined using the KL divergence between a model predicted Dirichlet Distribution D(pi|αi; θLe(H)) and a sharp baseline Dirichlet Distribution D(pi|ˆαi; θLe(H)) that assigns high evidence (i.e., ˆαij 1) to ground-truth label j and zero evidence to other labels. In the second term L(2) i (θ), we iteratively remove the ancestor c An(y) from H, resulting in a new hierarchy H \ c) each time. For each c, L(2) i,c (θ) is defined by the KL divergence between a model predicted Dirichlet Distribution D(p|α; θLe (H\c)) and a less sharp Dirichlet Distribution D(p| α; θLe (H\c)), where Le (H \ c) denotes N(Pa(c)) Le(H \ c). The Dirichlet parameters of the two baseline distributions are given as ( β1 1, if k = j H 1 otherwise (8) ( 1 < β2 < β1, if k = j H\c 1 otherwise (9) where j H and j H\c denote the known ground-truth label and and novel known ground-truth labels when c is removed. Figure 4 (b) illustrates the key idea of the proposed loss function, which forms an evidence margin (β1 β2) to clearly separate the known ground-truth an novel groundtruth labels. Learning such a evidence margin can lead to a much improved HND performance in testing. As shown in Figure 5 (b), the model tends to predict much higher evidence for known samples than the novel samples. In contrast, without learning the evidence, the model predicts Hierarchical Novelty Detection via Fine-Grained Evidence Allocation very similar logits for both known and novel samples margin as shown in Figure 5 (a). 3.4. Theoretical Analysis In this section, we perform a deep theoretical analysis to understand the key properties of the proposed loss function. First, we analyze how the proposed loss function can guarantee to learn an evidence margin to separate known ground truth and novel ground-truth labels in Theorem 3.1. We then show that the updates from two loss terms do not conflict with each other in Theorem 3.2 as the same data sample is leveraged in both terms in a multi-task fashion. Theorem 3.1 (Evidence margin learning). Given a hierarchy H and a training sample i. The known ground truth class is y with index j H and the novel ground truth index is j H\c, c An(y). The loss function trains the model to assign evidence such that 1 αj H β1, 1 αj H\c β2, c An(y) (10) And when the learning converges, the Dirichlet parameters form an evidence margin given by (β1 β2). Proof. (Proof sketch) Limited by the space, we provide the proof of the theorem in Appendix. We first define a general form of KL divergence-based loss function with baseline ground truth β. For αj = β and αk =j = 1, we show that loss becomes 0. We then show that the loss decreases when αj starts increasing until it reaches β and further increasing αj beyond β causes the loss to increase. Theorem 3.2 (Non-conflicting update). When optimizing the overall loss function in (8) that involves simultaneously minimizing the two loss terms L(1) i (θ) and L(2) i (θ), it does not lead to a conflict in the model predicted Dirichlet parameters α. Proof. (Proof sketch) We provide the details of the proof in Appendix. We first identify the common parameters between two loss terms. We then show that for the noncommon parameters, each loss term updates them independently. Finally, for the common parameters, we show that there is no conflicting update. Remarks: Theorem 2 ensures that the model parameters can be updated consistently when optimizing the jointly objective function in (8). Besides, ensuring the evidence margin, both loss terms also try to assign minimum evidence to the non-ground truth labels. As an example in Figure 4(b), the model learns to output a Dirichlet parameter with value 1 for all non-ground truth labels: Yield, Hiking Trail, Picnic Area. 3.5. Incorporating the Prior Belief The evidential theory allows us to encode a prior belief in the form of base rate distribution a as we calculate the effective number of observations. Base rates denote the prior probabilities of a data sample belonging to the classes when no evidence is observed and W quantifies the weight of the base rates. In the most common setting with no strong prior belief, the Dirichlet parameter is related to evidence as αk = rk + 1, where evidence rk represents the observed number of observations in support of class k. Recall that the 1 is the result of ak W with a weak base rate 1 K and setting W = K. By adjusting the base rates, we can incorporate more appropriate prior belief that can further improve the HND performance in practice. In particular, a higher base rate for the known classes denote the belief of completeness of the hierarchy, and a test sample will more likely be assigned to one of the known leaf classes. In contrast, a higher base rate for the novel classes allows us to encode the belief that the current hierarchy is still incomplete. By leveraging this Bayesian formulation, we can assign different base rates given the distinct prior belief on how incomplete each sub-hierarchy is. Applying a higher base rate to class k has the effect of enforcing a stronger prior belief by increasing the pseudo counts. As an example, if we have a stronger belief that only the subhierarchy of Recreation Sign is more likely to be incomplete as part of the hierarchy shown in Figure 2, we can modify the base rate distribution by using a higher base rate ak > 1/K for the corresponding Novel Recreation Sign class. As a result, the pseudo count increases for Novel Recreation Sign and decreases for other classes. In this way, a prior belief can be encoded through base rates, resulting in an increased Dirichlet parameter that affects the loss function accordingly. 4. Experiments We conduct experiments on real-world hierarchical datasets to assess the effectiveness of the proposed method. We investigate the effects of using different sets of hyperparameters to confirm the positive impact of the learning an evidence margin. Finally, we explore the trade-off between known and novel performance by adjusting the different values of base rate distributions. Additional experiments are presented in the Appendix C. 4.1. Datasets To evaluate the performance of E-HND and other competitive baselines, we use four real-world hierarchical datasets: Tiny Imagenet (Le & Yang, 2015): It contains 200 classes each with 500 training, 50 validation, and 50 test images in each class, resulting in a total of 120k images. We randomly select 50 classes as novel and the remaining classes as known classes. For the known class, we create the hierarchy using the hypernym-hyponym relationship from Word Net. The resulting hierarchy contains 150 leaf nodes, 86 non-leaf nodes, and 12 levels. CUB-200-2011 (Welinder et al., 2010): It contains 12k Hierarchical Novelty Detection via Fine-Grained Evidence Allocation Table 1. Comparison Results Method CUB Tiny Imagenet AWA2 Traffic NA@50 AUC NA@50 AUC NA@50 AUC NA@50 AUC DARTS 40.42 30.07 15.91 12.18 36.75 35.14 34.00 30.36 Relabel 38.23 28.75 18.67 14.73 45.71 40.28 39.67 34.03 Evidential 35.06 25.86 19.35 14.53 44.82 36.44 37.32 32.57 HCL 32.19 25.22 13.45 10.19 36.40 32.80 34.17 33.70 LOO 42.25 32.81 18.93 14.50 47.82 41.95 41.51 35.47 E-HND 46.18 35.31 21.44 16.03 48.22 42.37 45.09 41.02 TD+LOO 44.42 34.31 19.37 14.87 50.25 42.86 42.41 38.22 TD+E-HND 46.85 35.78 21.77 16.39 52.53 45.56 47.69 43.11 Results are obtained from local reproduction images of fine-grained species of bird with a total of 200 classes. We use a 150-50 split of known-novel classes. We construct the hierarchy using hypernym-hyponym relationships from Word Net. The hierarchy consists of 43 non-leaf nodes, 150 leaf nodes, and 7 levels. Animals With Attributes 2 (Lampert et al., 2014): It contains 37k images of animals with total 50 classes. We use a 40-10 split of known-novel classes. We construct the hierarchy using hypernym-hyponym relationships from Word Net, resulting in 21 non-leaf nodes, 40 leaf nodes, and 7 levels. Mapillary Traffic Sign Dataset (Ertler et al., 2019) It consists 70k images of traffic signs with total of 203 classes. We use a 164-39 split of known-novel classes from (Ruiz & Serrat, 2022) and construct hierarchy using parent-child relationships from (Ruiz & Serrat, 2022), resulting in 41 non-leaf, 164 leaf nodes, and 4 levels. 4.2. Compared Baselines We compare E-HND with the following state-of-the-art baselines in HND. It is worth noting that performing Hierarchical Classification augmented with Novelty Detection(HC-ND) requires setting multiple thresholds, as discussed in the introduction. Hence, it is difficult to make a fair comparison, so we do not include it as one of the competing baselines in table 1. Instead, we use the features from one of the HC-ND methods as input to our method (TD+EHND) and the best-performing baseline (TD+LOO). We discuss results from HC-ND methods in Appendix C. Dual Accuracy Reward Trade-off Search (DARTS) (Deng et al., 2012): Following the modified version of DARTS (Lee et al., 2018), we obtain the expected rewards for all the classes. Relabel (Lee et al., 2018): For training the novel classes, the training samples from known leaf classes are randomly relabeled as novel non-leaf classes. Leave-One-Out (LOO) (Lee et al., 2018): The method removes the class from the hierarchy one at a time, changing its the ground truth as the novel parent. Top-Down Features + Leave-One-Out (TD+LOO) (Lee et al., 2018): TD+LOO uses features extracted from the TD method as input to the LOO method. In contrast, LOO method directly uses Resnet101 features as input. Evidential (Sensoy et al., 2018): There are different ways of training an evidential model. We construct a baseline using the evidential log loss. Hierarchical Cosine Loss (HCL) (Ruiz & Serrat, 2022): As explained earlier, HCL uses a cosine loss to learn prototypes of all the classes in the flattened structure in HND. We follow the code provided by the paper. 4.3. Evaluation Metrics The test dataset contains samples from known and novel classes. Known Accuracy (K-ACC) denotes the ratio of correctly predicted leaf-level labels by the model out of the total known test samples. Similarly, Novel Accuracy (NACC) denotes the ratio of correctly predicted closest parent by the model out of novel test samples. For the practical testing scenario of HND, we should compare our method using both K-ACC and N-ACC. However, these accuracies are in a trade-off relation, i.e., an increase in one accuracy causes the other accuracy to decrease. In order to capture the trade-off relation, we can add a bias term to the logit of novel classes that increases N-ACC and decreases K-ACC. With the use of different biases, we can obtain different sets of K-ACC and N-ACC and plot them to obtain a KACC vs N-ACC plot. This allows us to compute the Area Under the Curve (AUC) of the plot to fairly compare all the methods. Similarly, we also report N-ACC, where the model has exactly 50% K-ACC, as the evaluation metric denoted by NA@50. 4.4. Results and Discussion We provide the results of the datasets for E-HND along with the baselines in Table 1. From the results, we can see that E-HND outperforms the baselines for all the datasets, achieving the best performance. The superior performance proves the effectiveness of E-HND. Further, we utilize the hierarchical features (TD features) as input to our method and the best-performing baseline, leading to TD+E-HND and TD+LOO, respectively. We see an increased performance for both methods, while our method maintains the superior performance. We present the K-ACC vs N-ACC plots for the datasets in Hierarchical Novelty Detection via Fine-Grained Evidence Allocation 0.0 0.2 0.4 0.6 Known Class Accuracy Novel Class Accuracy E-HND TD+E-HND LOO TD+LOO Evidential 0.0 0.2 0.4 0.6 Known Class Accuracy Novel Class Accuracy (b) Tiny Imagenet E-HND TD+E-HND LOO TD+LOO Evidential 0.0 0.2 0.4 0.6 0.8 1.0 Known Class Accuracy Novel Class Accuracy E-HND TD+E-HND LOO TD+LOO Evidential 0.0 0.2 0.4 0.6 0.8 Known Class Accuracy Novel Class Accuracy (d) Traffic E-HND TD+E-HND LOO TD+LOO Evidential Figure 6. Known accuracy vs Novel accuracy curve for E-HND along with the baselines for 4 hierarchical datasets Figure 6 for E-HND, along with the baselines. The set of biases used to obtain the plot represents the additional logits added to novel classes in the settings of (TD+LOO, relabel, LOO). For evidential models, E-HND, and TD+E-HND, the additional bias represents the pseudo counts . We can alternatively achieve the K-ACC vs N-ACC for our setting by adjusting the base rate distribution. We study the impact of different base rates for novel classes in Section 4.5. From the K-ACC vs N-ACC plots for hierarchical datasets, we observe that TD+E-HND has superior performance than other methods. Due to the training mechanism to create an evidence margin in our method, we are able to obtain higher novel accuracies than other baselines for different performances of known accuracy. 4.5. Ablation Studies Impact of β1 and β2 on performance. For using E-HND, we have two hyperparameters β1 and β2. We recommend setting the value as β1 > β2 1. To study the impact of different values of β1 and β2, we plot the AUC values on test samples of the CUB dataset for different settings of β1 and β2. In Figure 7(a), we keep β2 to a fixed value and vary the value of β1 to obtain performance results on the test set. Looking at the curve, we see that, for β1 < β2, the performance is on the lower region. As β1 starts increasing, then the performance increases, becoming almost constant for a range of values. Then as β1 reaches the value in a much higher range, the performance starts decreasing. Too high margin has lower performance, as setting β1 β2 results in L(1) i (θ) much higher than L(2) i (θ), focusing the effect of overall loss function mostly on training known classes. Similarly, in Figure 7 (b), we keep β1 to a fixed value and vary the value of β1 to obtain different performance results on the test set. We see that, in the lower region of β2, the suitable margin is created, which results in higher performance. However, as β2 increases and β2 > β1, the performance becomes much lower. Both of the plots confirm that a reasonable margin of β1 > β2 has a positive impact on the performance. Therefore, these hyperparameters are easy to set as long as we maintain a reasonable margin of β1 > β2. However, making β1 extremely high should be avoided as 0 100 200 300 400 500 (a) Keeping 2 fixed 0 100 200 300 400 500 (b) Keeping 1 Fixed Figure 7. Impact of different values of β1 and β2. (a) Vary β1 for fixed value of β2 = 10 and β2 = 20 for CUB (b) Vary β2 for fixed value of β1 = 45 and β1 = 60 we see the decreasing performance for β1 β2. Impact of base rate distribution. In this section, we study the impact of using different base rates for novel classes. Let, a(kn) and a(no) denote base rates for known and novel classes respectively to represent completeness and incompleteness of the hierarchy such that k=1 a(kn) k + k=1 a(no) k = 1 (11) Now, varying P|NLe(H)| k=1 a(no) k in the range of [0, 1], we obtain the base rates of each novel class by distributing the novel base rate to all the novel classes. We can obtain corresponding values of the base rate for known classes using (11). Now, for different sets of known and novel base rates, we obtain the corresponding K-ACC and N-ACC. As we increase the value of the novel base rate, we observe in Figure 8(a) for CUB dataset, that K-ACC starts decreasing, and N-ACC starts increasing. This improvement is caused by an increase in pseudo counts eventually increasing the effective number of observations for novel classes. However, the increase in the novel base rate can not improve the N-ACC beyond the value of 57% indicating the challenge associated with correctly identifying the closest parent of a novel test sample trained with known samples only. The increase in novel base rate eventually makes the K-ACC 0% referring to the compromise that comes with using the strong prior that the hierarchy is incomplete. The trend for other datasets is similar to CUB. As we see in Figure 8 (b), (c), and (d), as the novel base rate increases, the N-ACC starts increasing Hierarchical Novelty Detection via Fine-Grained Evidence Allocation 0.0 0.2 0.4 0.6 0.8 1.0 Novel Base Rate Known Accuracy Novel Accuracy 0.0 0.2 0.4 0.6 0.8 1.0 Novel Base Rate (b) Tiny Imagenet Known Accuracy Novel Accuracy 0.0 0.2 0.4 0.6 0.8 1.0 Novel Base Rate Known Accuracy Novel Accuracy 0.0 0.2 0.4 0.6 0.8 1.0 Novel Base Rate (d) Traffic Known Accuracy Novel Accuracy Figure 8. Impact of using base rate distribution for (a) CUB, (b) Tiny Imagenet, (c) AWA2, (d) Traffic dataset due to the effect of an increase in pseudo counts for novel classes. We observe that the highest N-ACCs obtained by adjusting the base rate are 22%, 58%, and 72% for Tiny Imagenet, AWA2, and Traffic respectively. The upper bound of N-ACC is lower for Tiny Imagenet as the hierarchy is deeper than CUB, AWA2 and Traffic, making the task of identifying the closest parent for novel samples much more difficult. The plot clearly demonstrates the trade-off between K-ACC and N-ACC that can be obtained by adjusting the base rates for known and novel classes. 4.6. Qualitative Study We show the prediction of TD+E-HND with some of the baselines, visualized with the corresponding hierarchy for the representative samples from CUB dataset in Figure 9. For the novel sample in (a), the true label and its closest parents are coded in red and green respectively in the hierarchy. Since the true novel class is not present in the hierarchy, the ground truth label is its parent. For the known sample in (b), the true label (also the ground truth) is coded in green. For the representative sample of Yellow Bellied Flycatcher in Figure 9(a), we observe that our method is able to identify the closest parent Old World Flycatcher. However, other methods get confused about the sample with other classes. In particular, TD+LOO method assigns it to Acadian Flycatcher, a child class of the ground truth. We note that Acadian Flycatcher is a known leaf class in the hierarchy. The training method of TD+LOO leverages the samples from Acadian Flycatcher as a sample of novel Old World Flycatcher. As mentioned in the challenges of LOO training, this can create confusion as the model assigns high logits to both novel and known classes. As a result, the TD+LOO is not able to distinguish whether the sample belongs to Acadian Flycatcher or Novel Old World Flycatcher. We present the distribution of logits allocated by TD+LOO for the novel samples of Yellow Bellied Flycatcher to classes: Old World Flycatcher and Acadian Flycatcher in Figure 14(a) in the Appendix. We observe that both classes are assigned logits in the high region, making them indistinguishable. However, in our case of TD+E-HND in Figure 14(b), we see that Acadian Flycatcher has evidence in the lower region than Old World Flycatcher. Therefore, for the particular sample in Figure 9(a), our method is able to distinguish the Passeriform Oscine Bird Old World Flycatcher Yellow Bellied Acadian Flycatcher LOO, Evidential Passeriform Oscine Bird Old World Flycatcher LOO, TD+LOO Acadian Flycatcher (a) (b) Figure 9. Qualitative study for representative test samples: (a) Prediction for novel sample from Yellow Bellied Flycatcher (b) Prediction for known sample from Acadian Flycatcher. closest parent class. Similarly, in Figure 9(b), we have the prediction for the known sample from Acadian Flycatcher class. LOO and TD+LOO predict the representative sample as Old World Flycatcher. This is due to confusion between logits of Old World Flycatcher and Acadian Flycatcher. We present a qualitative study of representative samples from the Traffic dataset in the Appendix C.4. 5. Conclusion In this paper, we formulate a novel evidential framework to address the unique challenges associated with hierarchical novelty detection. The proposed E-HND framework leverages fine-grained evidence quantification, creating an evidence margin to distinguish between known and novel classes in the hierarchy. In order to guide the model to learn the evidence margin, we provide the design of a novel loss function with theoretical guarantees. Further, we provide a natural way to encode prior beliefs of completeness of hierarchy by leveraging base rate distribution. The proposed framework shows effectiveness in our extensive experiments with real-world hierarchical datasets. Hierarchical Novelty Detection via Fine-Grained Evidence Allocation Acknowledgement This research was supported in part by an NSF IIS award IIS-1814450. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agency. Impact Statement The proposed research provides a way to not only detect novel samples but also identify the closest parent class from the training data hierarchy. The work can be potentially useful in multiple applications to understand the relationship of novel samples with existing classes. Such relationships provide additional insights to make further decisions when tackling novel samples. While the field of novelty detection in AI treats novel samples and known classes as completely different entities, our work explores the idea of novel samples being related to known classes and aims to find that relationship. As a result, this work broadens the application of novelty detection by offering more information on novel samples, that eventually allows practitioners to make suitable decisions from additional insights. Bendale, A. and Boult, T. E. Towards open set deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1563 1572, 2016. Chang, D., Pang, K., Zheng, Y., Ma, Z., Song, Y.-Z., and Guo, J. Your flamingo is my bird : Fine-grained, or not. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11476 11485, 2021. Charpentier, B., Z ugner, D., and G unnemann, S. Posterior network: Uncertainty estimation without ood samples via density-based pseudo-counts. Advances in Neural Information Processing Systems, 33:1356 1367, 2020. Chen, G., Qiao, L., Shi, Y., Peng, P., Li, J., Huang, T., Pu, S., and Tian, Y. Learning open set network with discriminative reciprocal points. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part III 16, pp. 507 522. Springer, 2020. Chen, G., Peng, P., Wang, X., and Tian, Y. Adversarial reciprocal points learning for open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8065 8081, 2021. Chen, J., Wang, P., Liu, J., and Qian, Y. Label relation graphs enhanced hierarchical residual network for hierarchical multi-granularity classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4858 4867, 2022. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009. Deng, J., Krause, J., Berg, A. C., and Fei-Fei, L. Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3450 3457. IEEE, 2012. Du, R., Chang, D., Bhunia, A. K., Xie, J., Ma, Z., Song, Y.-Z., and Guo, J. Fine-grained visual classification via progressive multi-granularity training of jigsaw patches. In European Conference on Computer Vision, pp. 153 168. Springer, 2020. Ertler, C., Mislej, J., Ollmann, T., Porzi, L., and Kuang, Y. Traffic sign detection and classification around the world. Ar Xiv, abs/1909.04422, 2019. URL https://api.semanticscholar. org/Corpus ID:202542747. Fang, Z., Li, Y., Lu, J., Dong, J., Han, B., and Liu, F. Is outof-distribution detection learnable? Advances in Neural Information Processing Systems, 35:37199 37213, 2022. Fort, S., Ren, J., and Lakshminarayanan, B. Exploring the limits of out-of-distribution detection. Advances in Neural Information Processing Systems, 34:7068 7081, 2021. Han, K., Vedaldi, A., and Zisserman, A. Learning to discover novel visual categories via deep transfer clustering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8401 8409, 2019. Han, K., Rebuffi, S.-A., Ehrhardt, S., Vedaldi, A., and Zisserman, A. Automatically discovering and learning new visual categories with ranking statistics. ar Xiv preprint ar Xiv:2002.05714, 2020. Han, K., Rebuffi, S.-A., Ehrhardt, S., Vedaldi, A., and Zisserman, A. Autonovel: Automatically discovering and learning novel visual categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6767 6781, 2021. Lampert, C. H., Nickisch, H., and Harmeling, S. Attributebased classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453 465, 2014. doi: 10.1109/TPAMI. 2013.140. Hierarchical Novelty Detection via Fine-Grained Evidence Allocation Le, Y. and Yang, X. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015. Lee, K., Lee, K., Min, K., Zhang, Y., Shin, J., and Lee, H. Hierarchical novelty detection for visual object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1034 1042, 2018. Liu, W., Wang, X., Owens, J., and Li, Y. Energy-based outof-distribution detection. Advances in neural information processing systems, 33:21464 21475, 2020. Malinin, A. and Gales, M. Predictive uncertainty estimation via prior networks. Advances in neural information processing systems, 31, 2018. Neal, L., Olson, M., Fern, X., Wong, W.-K., and Li, F. Open set learning with counterfactual images. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018. Rastegar, S., Doughty, H., and Snoek, C. Learn to categorize or categorize to learn? self-coding for generalized category discovery. Advances in Neural Information Processing Systems, 36, 2024. Ren, J., Fort, S., Liu, J., Roy, A. G., Padhy, S., and Lakshminarayanan, B. A simple fix to mahalanobis distance for improving near-ood detection. ar Xiv preprint ar Xiv:2106.09022, 2021. Ruiz, I. and Serrat, J. Hierarchical novelty detection for traffic sign recognition. Sensors, 22(12):4389, 2022. Sensoy, M., Kaplan, L., and Kandemir, M. Evidential deep learning to quantify classification uncertainty. Advances in neural information processing systems, 31, 2018. Vaze, S., Han, K., Vedaldi, A., and Zisserman, A. Open-set recognition: A good closed-set classifier is all you need. In International Conference on Learning Representations, 2021. Vaze, S., Han, K., Vedaldi, A., and Zisserman, A. Generalized category discovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7492 7501, 2022. Wang, Y., Hu, Q., Chen, H., and Qian, Y. Uncertainty instructed multi-granularity decision for large-scale hierarchical classification. Information Sciences, 586:644 661, 2022. Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., and Perona, P. Caltech-ucsd birds 200. 2010. Yang, H.-M., Zhang, X.-Y., Yin, F., Yang, Q., and Liu, C.-L. Convolutional prototype network for open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2358 2370, 2020. Zhang, H., Li, A., Guo, J., and Guo, Y. Hybrid models for open set recognition. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part III 16, pp. 102 117. Springer, 2020. Zhao, Y., Yan, K., Huang, F., and Li, J. Graph-based highorder relation discovery for fine-grained recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15079 15088, 2021. Zhong, Z., Fini, E., Roy, S., Luo, Z., Ricci, E., and Sebe, N. Neighborhood contrastive learning for novel class discovery. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10867 10875, 2021a. Zhong, Z., Zhu, L., Luo, Z., Li, S., Yang, Y., and Sebe, N. Openmix: Reviving known knowledge for discovering novel visual categories in an open world. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9462 9470, 2021b. Zhu, Z., Liang, D., Zhang, S., Huang, X., Li, B., and Hu, S. Traffic-sign detection and classification in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2110 2118, 2016. Hierarchical Novelty Detection via Fine-Grained Evidence Allocation Organization of the Appendix In section A, we provide a summary of notations with the corresponding description used in the paper and proofs. In section B, we provide proof of Theorems 3.1 and 3.2. In section C, we provide the details of experiments and additional results. In section D, we discuss the limitations of this work. In section E, we provide the link to the source code. A. Summary of Notations Notation Description H A training hierarchy of known classes y A class from hierarchy H Pa(y) A set of parents of class y in H Ch(y) A set of children of a class y in H An(y) A set of ancestors of a class y in H De(y) A set of descendants of a class y in H N(y) A set of novel classes for closest known parent y Le(H) A set of leaf classes of hierarchy H NLe(H) A set of non-leaf classes of hierarchy H p(kn) Probability distribution for known classes p(no) Probability distribution for novel classes H \ c Hierarchy obtained by removing class c Le (H \ c) Leaf classes for the hierarchy H \ c LCE(θ) Cross Entropy based loss function for LOO training b Belief distribution for K classes a Base rate distribution for K classes u Uncertainty mass w Multinomial opinion α Parameters of Dirichlet PDF r Evidence distribution for K classes W Non-informative prior weight Li(θ) E-HND loss function for sample i L(1) i (θ) E-HND loss term when no class is removed L(2) i,c (θ) E-HND loss term when class c is removed j H Ground truth index for L(1) i (θ) j H\c Ground truth index for L(2) i,c (θ) β1 Baseline Dirichlet parameter for j H β2 Baseline Dirichlet parameter for j H\c a(kn) Base rate distribution for known classes a(no) Base rate distribution for novel classes S(α(H)) A set of Dirichlet parameters of hierarchy H S(α(1) i (H)) A set of Dirichlet parameters by L(1) i (θ) S(α(2) i (H)) A set of Dirichlet parameters by L(2) i (θ) Hierarchical Novelty Detection via Fine-Grained Evidence Allocation B. Proof of Theorems B.1. Proof of Theorem 1 To prove the theorem, we first define a general KL divergence-based loss with target ground truth parameter β as given by: L = KL[D(p|α)||D(p|ˆα)] = ln Γ(α0) ln Γ(ˆα0) + k=1 ln Γ(ˆαk) ln Γ(αk) + k=1 (αk ˆαk)[ψ(αk) ψ(α0)] (12) ˆαj = β, ˆα0 = β + K 1. For k = j, ˆ αk = 1, we have the value as: L = ln Γ(α0) ln Γ(β + K 1) + ln Γ(β) ln Γ(αj) + (αj β)[ψ(αj) ψ(α0)] k=1,k =j ln Γ(1) ln Γ(αk) + k=1,k =j (αk 1)[ψ(αk) ψ(α0)] (13) The non-ground truth Dirichlet parameter αk, k = j is fixed at 1 to make the analysis easier. The loss function becomes L[αk,k =j = 1] = ln Γ(α0) ln Γ(β + K 1) + ln Γ(β) ln Γ(αj) + (αj β)[ψ(αj) ψ(α0)] = ln Γ(αj + K 1) ln Γ(β + K 1) + ln Γ(β) ln Γ(αj) + (αj β)[ψ(αj) ψ(αj + K 1)] (14) For derivative, we use the following relations: ψ1(z + 1) = ψ1(z) 1 ψ(z) = ln(z) 1 Taking the derivative w.r.t loss function, we have: d L[αk,k =j = 1] dαj = ψ(αj + K 1) 0 + 0 ψ(αj) + (αj β)[ψ1(αj) ψ1(αj + K 1)] + 1[ψ(αj) ψ(αj + K 1)] (15) = (αj β)[ψ1(αj) ψ1(αj + K 1)] (16) = (αj β)[ψ1(αj) {ψ1(αj) 1 (αj + K 2)2 1 (αj + K 3)2 ... 1 α2 j }] (17) = (αj β)[ 1 (αj + K 2)2 + 1 (αj + K 3)2 + ... + 1 α2 j ] (18) Now, we prove that L decreases when there is increase in ground truth parameter αj till β; it becomes 0 at αj = β and increases after αj becomes greater than β using Lemma B.1, B.2, and B.3. Lemma B.1. KL divergence loss becomes zero when ground truth Dirichlet parameter αj is equal to β. Hierarchical Novelty Detection via Fine-Grained Evidence Allocation L[αj = β, αk,k =j = 1] = ln Γ(β + K 1) ln Γ(β + K 1) + ln Γ(β) ln Γ(β) + (β β)[ψ(β) ψ(β + K 1)] k=1,k =j ln Γ(1) ln Γ(1) + k=1,k =j (1 1)[ψ(1) ψ(β + K 1)] = 0 (19) Lemma B.2. KL divergence loss decreases when there is an increase in ground truth Dirichlet parameter αj until it reaches the fixed value β. Proof. Using equation 18, when ground truth αj < β, d L[αk,k =j=1] dαj < 0. Hence, for αj < β, loss decreases for increases in αj. Lemma B.3. KL divergence loss increases when there is an increase in ground truth Dirichlet parameter, αj when αj becomes greater than the fixed value β. Proof. Using equation 18, when ground truth αj > β, d L[αk,k =j=1] dαj > 0. Hence, for αj > β, loss increases for increases in αj. B.2. Proof of Theorem 2 S(.) denotes the operation that converts a vector to a set. The total Dirichlet parameters from the model are given by S(α(H)). For the purpose of analysis, we use a data sample i and separate the parameters trained by L(1) i (θ) and L(2) i (θ) for the sample into two sets: S(α(1) i (H)) and S(α(2) i (H)) respectively. In a more fine-grained manner, let S(α(2) i (H \ c)) denote parameters trained by L(2) i,c (θ). The total Dirichlet parameters in the model can be obtained as: S(αi(H)) = S(α(1) i (H)) [ c An(y) S(α(2) i (H \ c)) (20) The common parameters between S(α(1) i (H)) and S(α(2) i (H \ c)) denoted by S(αi(H, H \ c)) is obtained by S(αi(H, H \ c)) = S(α(1) i (H)) S(α(2) i (H \ c)) (21) = S(α(1) i (H)) \ [ d Le(H),d De(c) αid (22) The common parameters do not include ground truth parameters from L(1) i (θ) and L(2) i,c (θ) given by αij H and αij H\c, but only the non-ground truth parameters for sample i that doesn t belong to descendants of class c. Now, the difference of parameters between S(α(1) i (H)) and S(α(2) i (H \ c)) is given by: S(α(1) i (H)) \ S(α(2) i (H \ c)) = {αij H} [ d Le(H),d De(c) αid (23) S(α(2) i (H \ c)) \ S(α(1) i (H)) = {αij H\c} (24) Now, the common and difference of parameters between S(α(1) i (H)) and S c An(c) S(α(2) i (H \ c)) is obtained by Hierarchical Novelty Detection via Fine-Grained Evidence Allocation c An(c) H \ c)) = S(α(1) i (H)) \ {αij H} (25) S(α(1) i (H)) \ [ c An(c) S(α(2) i (H \ c)) = {αij H} (26) c An(c) S(α(2) i (H \ c))] \ S(α(1) i (H)) = [ c An(c) {αij H\c} (27) From (25), we see that the common parameters between L(1) i (θ) and L(2) i (θ) include parameters only from non-ground truth known leaf classes. Moreover, the parameter only trained by L(1) i (θ) is parameter of ground truth known leaf class. Finally, the parameters only trained by L(2) i (θ) are ground truth novel non-leaf classes. Now, we provide the definition of conflicting update. Definition B.4 (Conflicting update). A conflicting update is defined for L(1) i (θ), L(2) i (θ) and a parameter α when either of the following conditions is true: Condition I: L(1) i (θ) increases α and L(2) i (θ) decreases α. Condition II: L(1) i (θ) decreases α and L(2) i (θ) increases α. Next, for ground truth parameters, we prove that there is no conflicting update between loss terms using lemma B.5. Lemma B.5. For the ground truth parameters, {αij H} and S c An(c){αij H\c}, conditions I and II from Definition B.4 do not hold. Proof. d L(2) i (θ) dαij H = 0 as L(2) i (θ) is not the function of αij H. From the definition B.4, both conditions I and II do not hold, as L(2) i (θ) neither increases nor decreases αij H. c, c An(y), d L(1) i (θ) dαij H\c = 0 as L(1) i (θ) is not the function of αij H\c. From the definition B.4, both conditions I and II do not hold, as L(1) i (θ) neither increases nor decreases αij H\c. Hence, there are no conflicting updates for the parameters: {αij H} and S c An(c){αij H\c} Now, for analysis of updates for non-ground truth parameters, we prove that KL divergence-based loss trains the model to output 1 using Lemma B.6. Lemma B.6. KL divergence loss increases when there is an increase in non-ground truth Dirichlet parameter αk, for k = j. For ease of analysis, fix αj = β. For K Dirichlet parameters, suppose K is the ground truth index, α1, α2, ..., αK 1 are non-ground truth Dirichlet parameters. L[αj = αK = β] = ln Γ(α0) ln Γ(β + K 1) + k=1,k =j ln Γ(1) ln Γ(αk) + k=1,k =j (αk 1)[ψ(αk) ψ(α0)] Taking the derivative of equation 28 w.r.t α1, we have Hierarchical Novelty Detection via Fine-Grained Evidence Allocation d L[αj = αK = β] dα1 = ψ(α0) ψ(α1) + (α1 1)[ψ1(α1) ψ1(α0)] + ψ(α1) ψ(α0) + k=2,k =j (αk 1)[ ψ1(α0)] = (α1 1)[ψ1(α1) {ψ1(α1) 1 (α0 1)2 1 (α0 2)2 ... 1 (α1)2 }] k=2,k =j (αk 1)[ ψ1(α0)] (30) = (α1 1)[ 1 (α0 1)2 + 1 (α0 2)2 + ... + 1 (α1)2 ] + k=2,k =j (αk 1)[ ψ1(α0)] (31) Here, the first term is positive, and for the second, term we use the limit definition to take the derivative of ψ(α0), we have the relation: ψ1(α0) = lim α1 0 ψ(α0 + α1) ψ(α0) ln(α0 + α1) 1 2(α0+ α1) ln(α0) + 1 2α0 α1 2 α1 α0(α0+ α1) α1 This is a 0 0 form. Hence, using L Hopital rule, taking derivative on both numerator and denominator, we have ψ1(α0) = lim α1 0 1 α0+ α1 + 1 2(α0+ α1)2 1 α0 1 2α2 0 1 ψ1(α0) = lim α1 0 α1 α0(α0 + α1) 2α0 α1 + α2 1 2α2 0(α0 + α1)2 < 0 Hence, the second term is also positive, making d L[αj=αK=β] dα1 > 0. Therefore, KL divergence loss increases when there is an increase in non-ground truth Dirichlet parameter α1. The minimum allowed value of the Dirichlet parameter is 1. Hence, the non-ground truth parameter approaches 1. This can be proved similarly for other non-ground truth Dirichlet parameters αk, k = j. Finally, for common parameters between loss terms: non-ground truth parameters, we prove that there is no conflicting update between loss terms using lemma B.7. Lemma B.7. For the common parameters, S(αi(H, S c An(c) H \ c)), conditions I and II from definition B.4 do not hold. Proof. We observe that the common parameter αi S(α(1) i (H)) \ {αij H} is a non-ground truth parameter for both L(1) i (θ) and L(2) i,c (θ), c An(y). For a non-ground truth parameter, αik, k = j H, k = j H\c, c An(y), the updates from loss terms are given by d L(1) i (θ) dαik > 0 and d L(2) i (θ) dαik > 0. From the definition B.4, condition I is false as L(1) i (θ) decreases αik. From the definition B.4, condition II is false as L(2) i (θ) decreases αik. Hence, both conditions I and II do not hold. Hierarchical Novelty Detection via Fine-Grained Evidence Allocation Table 2. Significance Test CUB Tiny Imagenet AWA2 Traffic NA@50 AUC NA@50 AUC NA@50 AUC NA@50 AUC TD+LOO 44.82 0.52 34.50 0.14 19.16 0.33 14.31 0.36 50.13 0.12 42.23 0.24 42.41 0.50 38.22 0.39 TD+E-HND 46.92 0.23 35.83 0.07 21.80 0.10 16.39 0.14 53.76 1.2 45.39 0.40 47.69 0.14 43.11 0.11 p-value 8.20 10 10 7.71 10 16 3.94 10 15 2.98 10 13 1.94 10 8 3.14 10 14 2.13 10 17 1.08 10 18 Hence, there are no conflicting updates for the common parameters: S(αi(H, S c An(c) H \ c)). C. Details of Experiments and Additional Results C.1. Significance Test The result presented in table 1 is from a single run of the method. Therefore, we perform a t-test to find out if the difference between evaluation metrics between our method and the baselines is significant enough. We run our method(TD+HND) and the best performing baseline(TD+LOO) for 10 times using different values of random seed. The obtained mean, standard deviation and p-values obtained are presented in table 2. We can see from the table that p-values obtained for the combination of all the datasets and evaluation metrics are low. Hence, the gap between the evaluation metrics of TD+E-HND and TD+LOO is significant. C.2. Training Details To speed up the training, we use a standard Resnet-101 architecture as a backbone to extract the features from the training samples for CUB, AWA2 and Tinyimagenet datasets. For traffic(MTSD) dataset, we use Resnet101 features provided by (Ruiz & Serrat, 2022). We train the model using the full batch of Resnet-101 features with Adam optimizer and an initial learning rate of 10 2. We use the validation set to select the suitable hyperparameters β1 and β2. The validation set does not include samples from the novel classes. We use the set of (β1, β2) values of (65, 20), (30, 5), (20, 5) and (40, 5) for CUB, AWA2, Tiny Imagenet and Traffic respectively. For obtaining evidence from the classification network, we use Softplus activation on logit. The detailed algorithm is provided in Algorithm 1. All the experiments are conducted using NVIDIA Ge Force RTX 3060 with 32GB memory. The training algorithm is implemented in pytorch version: 1.13.0 and cuda version: 11.6. The hierarchy information associated with datasets is first computed and stored in a .npy file using numpy(a library of python) to avoid runtime computations of hierarchy information during the model training. Algorithm 1 E-HND Training 1: Require Hyperparameters: β1, β2 2: Require Hierarchy definition: H of the known classes. 3: Input Initialized Model: θ. 4: Input A set of training samples with ground truth labels {(xi, yi)}N i=1. 5: while not Stop Criterion do 6: for a pair of training sample with ground truth label (xi, yi) do 7: Calculate observed evidences for all the classes ri = [ri1, ri2, ...ri K] using the model using the equation: rk = softplus(θ(xi)) 8: Calculate model predicted Dirichlet parameters αi for each class k using equation 5. 9: Calculate the loss Li(θ) using equation 8. 10: end for 11: Calculate the total loss for all the samples L(θ) = 1 N PN i=1 Li(θ). 12: Update model parameter θ by backpropagating loss L(θ) 13: end while Hierarchical Novelty Detection via Fine-Grained Evidence Allocation Table 3. Comparison with the evidential design of loss in HND Method CUB NA@50 AUC Log loss (Sensoy et al., 2018) 35.06 25.86 Digamma loss (Sensoy et al., 2018) 37.01 26.77 MSE loss (Sensoy et al., 2018) 13.15 15.32 E-HND 46.18 35.31 C.3. Additional Experiment Results IMPACT OF EVIDENTIAL LOSS We designed the loss function to learn fine-grained evidence for known and novel classes. In this section, we study the impact of using the loss function based on the evidential formulation provided by (Sensoy et al., 2018). We carry out experiments on the CUB dataset. We design these loss functions in the setting of Hierarchical Novelty Detection: Llog i (θ) = k=1 yik[ln(St H i ) ln(αik)] + X |Le (H\c)| X k=1 yik[ln(St H\c) ln(αik)] (32) Ldigamma i (θ) = k=1 yik[ψ(St H i ) ψ(αik)] + X |Le (H\c)| X k=1 yik[ψ(St H\c i ) ψ(αik)] (33) Lmse i (θ) = k=1 [(yik αik/St H i )2 + αik(St H i αik) (St H i )2(St H i + 1)] |Le (H\c)| X k=1 [(yik αik/St H\c i )2 + αik(St H\c i αik) (St H\c i )2(St H\c i + 1) ] (34) The strengths are calculated using the following relations: k=1 αik, St H\c i = |Le (H\c)| X k=1 αik (35) The results are presented in table 3. We observe that E-HND has the best performance as provided by NA@50 and the AUC. As other forms of evidential loss from equations [32, 33, 34] do not allow to upper bound the evidence for ground truth. We can not use these formulations to create evidence-margin between known and novel classes. Hence, the design of the loss function of E-HND is justified. ADDITIONAL DATASETS We conduct the experiments on two additional datasets, Image Net-1k (Deng et al., 2009) and TT100K (Zhu et al., 2016). We compare with the competitive baselines, and the results are summarized in Table 4 . More specifically, the depth of the hierarchy for Image Net-1k is 14, with 1000 leaf nodes and 396 non-leaf nodes, making it the most challenging dataset for HND. The proposed method clearly outperforms the competitive baselines. The depth of TT100K is 4 with 80 leaf nodes and 15 non-leaf nodes. Our method outperforms competitive baselines for both datasets. It is worth noting that HCL is designed specifically for the traffic dataset as the hierarchy of this dataset is very closely related to the visual appearances of the classes (Ruiz & Serrat, 2022) and the traffic signs have relatively simple visual appearances. This leads to the superior performance of HCL on TT100k. Since these key properties do not always hold for the general datasets, HCL achieves sub-optimal performance on the general datasets. Hierarchical Novelty Detection via Fine-Grained Evidence Allocation Table 4. Comparison Results on Additional Datasets Method Imagenet-1k TT100K NA@50 AUC NA@50 AUC LOO 14.93 11.73 65.71 60.82 E-HND 17.14 12.27 69.76 60.94 TD+LOO 16.63 13.44 70.36 57.41 TD+E-HND 19.26 14.23 71.62 63.64 0 10 20 30 Logit (a) LOO Testing Known Novel 0 10 20 30 Evidence (b) E-HND Testing Known Novel Figure 10. Comparison of distribution of (a) logits and (b) evidences for known and novel test samples in Tiny Imagenet dataset 0 10 20 30 40 Logit (a) LOO Testing Known Novel 0 10 20 30 40 Evidence (b) E-HND Testing Known Novel Figure 11. Comparison of distribution of (a) logits and (b) evidences for known and novel test samples in AWA2 IMPACT OF THE EVIDENCE MARGIN We show the effect of learning evidence margin in Figure 5 for CUB test samples in the main paper. Learning to create evidence margin for known and novel classes has the effect of lower evidence allocation to novel test samples than known test samples, while LOO allocates high logits to both known and novel test samples. We plot the logits and evidence distribution of (a) E-HND and (b) LOO methods for Tiny Imagenet dataset in Figure 10, AWA2 dataset in Figure 11 and Traffic dataset in Figure 12. A similar effect is observed for known and novel test samples in the rest of the datasets. Due to this effect, our method has improved novelty detection performance in comparison to LOO. COMPARISON WITH NOVELTY DETECTION METHODS In table 1, we use the features from one of the hierarchical classification augmented with novelty detection(HC-ND) methods as input to LOO and E-HND methods. We see that, when features from HC-ND method are used, the performance increases for all the datasets. In this section, we compare the result for (i+) flatten methods with (ii ) HC-ND methods. For HC-ND methods, each non-leaf node is treated as a classifier. O(nle)) are the leaf nodes that do not belong to the descendants of the non-leaf node nle as a classifier. We describe HC-ND methods as: Hierarchical Novelty Detection via Fine-Grained Evidence Allocation 0 10 20 30 40 50 Logit (a) LOO Testing Known Novel 0 10 20 30 40 50 Evidence (b) E-HND Testing Known Novel Figure 12. Comparison of distribution of (a) logits and (b) evidences for known and novel test samples in Traffic dataset Top-Down(TD) (Lee et al., 2018): The classifiers are trained with cross-entropy loss. A regularization is used to induce uniform probability values for samples that do not belong to the descendants of the classifier. The loss function and confidence score comparison are defined by (Lee et al., 2018) for every non-leaf class in the hierarchy as: nle NLe(H), LTD = E p(x,y|nle)[ ln p(y|x, nle; θnle)] + E p(x,y|O(nle)) KL[U(.|nle)||p(.|x, nle; θnle)] (36) KL[U(.|nle)||p(.|x, nle; θnle)] λnle (37) Maximum Softmax Probability(MSP) (Vaze et al., 2021) Maximum Softmax Probability is one of the most widely used baselines in the field of novelty detection. For MSP baseline, we use the loss from equation 36. For the final prediction, we use the confidence score comparison as: max(Pr(.|x, nle; θnle)) λnle (38) HC-ND with |Ch(nle)| + 1 class (Neal et al., 2018) The baseline denotes novel class by K + 1 for novelty detection in multi-class classification of K classes. Following the method, we modify the HC-ND method such that each non-leaf class contains an extra novel class to classify. Now, instead of using regularization that induces uniform probability distribution, data samples from O(nle) are used as the samples of novel |Ch(nle)| + 1 class. The resulting baseline becomes threshold free and contains |NLe(H)| novel nodes. We define the loss function and confidence score comparison as: nle NLe(H), L|Ch(nle)|+1 = E p(x,y|nle)[ ln p(y|x, nle; θnle)] + E p(x,y|O(nle))[ ln p(N(nle)|x, nle; θnle)] (39) max(Pr(.|x, nle; θnle)) p(N(nle)|x, nle; θnle) (40) Evidential Uncertainty (Sensoy et al., 2018) We use uncertainty mass from evidential theory as an uncertainty measure. We use log loss (Sensoy et al., 2018) to train the classifier to classify correct class and output low uncertainty to known samples. To train the classifier to output high uncertainty for novel samples, we define the regularization using KL divergence between uniform Dirichlet distribution and model Dirichlet distribution: nle NLe(H), LHC-evidential = E p(x,y|nle) |Ch(nle)| X k=1 yik[ln(Stnle i ) ln(αik)] + E p(x,y|O(nle)) KL[D(.|x, nle; θnle)||D(.| < 1, 1, ..., >, nle)] (41) Stnle i λnle (42) Hierarchical Novelty Detection via Fine-Grained Evidence Allocation Table 5. Comparison with the novelty detection methods for CUB dataset Category Method Known Accuracy Novel Accuracy Harmonic Mean Hierarchical TD 45.94 24.69 32.12 Max Softmax Probability 46.34 25.65 33.02 HC-ND with |Ch(nle)| + 1 class 46.83 26.02 33.45 Energy Score 47.69 30.56 37.25 Max Logit Score 47.80 30.76 37.43 Evidential uncertainty 47.37 27.50 34.80 Relabel+ 50.00 38.23 43.33 LOO+ 50.00 42.25 45.80 E-HND+ 50.00 46.18 48.01 Energy Score (Liu et al., 2020) We use the loss function as equation 41 to train the model. If fk(x; θnle) represents the logit for kth children of non-leaf node nle. The uncertainty comparison is given by: |Ch(nle)| X k=1 efk(x;θnle) λnle (43) Maximum Logit Score (Vaze et al., 2021) We use the loss function as equation 41 to train the model. The uncertainty comparison is given by: max(fk(x; θnle)) λnle (44) For the prediction, a sample is classified at each classifier starting from top from towards leaf nodes. Each classifier quantifies a confidence/uncertainty score to denote how confident/uncertain the classifier is towards the prediction. If the confidence/uncertainty is greater/smaller than a threshold, then its predicted class is used as next classifier till we get the final prediction. Thresholds required for confidence/uncertainty comparison are calculated using validation set that maximizes the harmonic mean between known and novel accuracy. Since the validation set does not contain real novel samples, novel samples are defined as samples from leaf nodes that do not belong to the descendants of the classifier. For flatten methods(+): Relabel, LOO, and our method(E-HND), we use the result where known accuracy is fixed at 50%. We also report the harmonic mean between known accuracy and novel accuracy for each method. We use CUB dataset to report the results for the baselines along with our method(E-HND) in table 5. We see that in comparison to TD method, using the maximum softmax probability score is effective for both known and novel classes. Similarly, with the use of other novelty scores like energy scores, maximum logit scores, and evidential uncertainty, the performance improves for both known and novel classes. However, these are still outperformed by all the compared flatten baselines like Relabel, LOO, and E-HND. A reason behind this might be the need to set the total of |NLe(H)| thresholds, each for a classifier, in HC-ND methods. Even when we eliminate the need to set thresholds as in HC-ND with |Ch(nle)| + 1 class , the result does not improve much. All HC-ND methods share a common top-down inference mechanism that can cause error accumulations. However, this is not prevalent in flatten methods. The comparison results show that the common baselines for novelty detection do not yield the best performances in the setting of hierarchical novelty detection. EXPERIMENTS ON FAR-OOD Novelty (or OOD) detection can be generally categorized into far-OOD and near-OOD detection. Most existing works focus on the former category, where the novel class samples have very different semantics from the known class samples. There have been increasing interest in the later category (Ren et al., 2021; Fort et al., 2021; Fang et al., 2022), where novel samples share some similar semantics with the known class samples, making the problem much more challenging. HND can be regarded as one type of near-OOD detection and by leveraging the existing hierarchical relationship among the known classes, we can perform effective fine-grained OOD detection to identify novel data samples that are semantically similar (i.e., being siblings) of the known classes. As mentioned in the introduction, most real-world objects can be described using a hierarchical structure based on their relationship with other relevant objects. Furthermore, since the proposed method performs evidential learning, detecting a sample that is outside of the entire hierarchy, a far-OOD situation, (i.e., an object Hierarchical Novelty Detection via Fine-Grained Evidence Allocation that is not a Traffic Sign in Figure 1) can be achieved by checking the predicted evidence over all the nodes within the hierarchy. If a low evidence is assigned to all the nodes, it implies that the model recognizes this object as not being part of the hierarchy. 0 5 10 15 20 25 30 35 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 Known vs Far-OOD Figure 13. Evidence distribution In this set of experiments, we apply the proposed method to the far-OOD problem. We study the evidence distribution of the proposed method on known and OOD datasets in Figure 13. We select 150 classes of CUB as known and Cifar10 as the far OOD dataset. We can see that samples from known classes are in the high-evidence region, and samples from OOD dataset are in the low-evidence region. As far-ood classes are semantically far away from the known classes along with their ancestors in the hierarchy, the evidence of separation for far-ood classes is more prominent. This confirms the effectiveness of the proposed method on far-ood situation as well. 0 5 10 Logit Old World Flycatcher Acadian Flycatcher 0 5 10 Evidence (b) TD+EHND Old World Flycatcher Acadian Flycatcher Figure 14. Comparison of the distribution of (a) logit and (b) evidence for Acadian Flycatcher and Old World Flycatcher by (a) TD+LOO and (b) TD+E-HND Regulatory no-something E-HND, TD+E-HND Regulatory no-bicycles LOO, TD+LOO Regulatory no-left-turn Regulatory no-something LOO, TD+LOO, Regulatory no-bicycles E-HND, TD+E-HND Figure 15. Qualitative study for representative test samples: (a) Prediction for novel sample from Regulatory no-buses (b) Prediction for known sample from Regulatory no-bicycles. Hierarchical Novelty Detection via Fine-Grained Evidence Allocation C.4. Qualitative Study We present the qualitative analysis of representative samples of CUB dataset in Figure 9 along with the distribution of logits and evidence in Figure 14 for the prediction of TD+LOO and the proposed method. In this section, we present the prediction of TD+E-HND with some of the baselines for the representative samples from Traffic dataset in Figure 15. A similar trend can be found in representative samples, as seen with the CUB dataset. Baseline like LOO, and TD+LOO mistakes the novel samples with known samples and can predict a novel sample from Regulatory no-buses sign as Regulatory no-bicycles. Similarly, a known sample from Regulatory no-bicycles is mistaken to be its parent class Regulatory no-something. D. Limitations While the proposed work is generalizable towards all the domains given the construction of a hierarchy associated with the training classes, the relationship between training classes may occur in a different format, e.g., a graph. It would be interesting to explore novelty detection in the other forms of relationship that can occur between novel samples and training data. Similarly, we have two hyperparameters β1 and β2 for defining the loss function, which could be viewed as a limitation. We have a general recommendation for setting the hyperparameters, which is presented in Section 4.5. E. Source Code The source code can be accessed here: https://github.com/ritmininglab/EHND