# energycompress_a_general_case_base_learning_strategy__de7e0b0c.pdf

Energy Compress: A General Case Base Learning Strategy

Fadi Badra1 , Esteban Marquer2 , Marie-Jeanne Lesot3 , Miguel Couceiro4 and David Leake5

1Universit e Sorbonne Paris Nord, Sorbonne Universit e, INSERM, Limics, 93000, Bobigny, France 2CRIL CNRS, Universit e d Artois, France 3Sorbonne Universit e, CNRS, LIP6, F-75005 Paris, France 4IST, University of Lisbon, INESC-ID, Lisbon, Portugal 5Luddy School, Indiana Unversity, Bloomington, IN, USA badra@sorbonne-paris-nord.fr, esteban.marquer@gmail.com, Marie-Jeanne.Lesot@lip6.fr, miguel.couceiro@inesc-id.pt, leake@iu.edu

Case-based prediction (CBP) methods do not learn a model of the target decision function but instead perform an inference process that depends on two similarity measures and a reference case base. This paper proposes a strategy, called Energy Compress, to learn an effective case base by selecting relevant cases from an initial set. Use of Energy Compress decreases CBP inference time, through case base compression, and also increases prediction performance, for a wide variety of CBP algorithms. Energy Compress relies on a general formulation of the CBP task in the framework of energy-based models, which leads to a new and valuable characterization of the notion of competence in case-based reasoning, in particular at the source case level. Extensive experimental results on 18 benchmarks comparing Energy Compress to 5 reference algorithms for case base maintenance support the benefit of the proposed strategy.

1 Introduction

Case-based prediction (CBP) methods, such as the k nearest neighbor (k-NN) classifier, the AP-classifier [Bounhas et al., 2017] or Co AT [Badra, 2020], are kernel-based learning methods, insofar as they crucially depend on the choice of a similarity or distance function that takes a case base as a parameter. These methods do not learn a model of the entire decision function prior to prediction, but instead infer the decision function value for a new case by direct comparison with similar cases retrieved from the case base. This inference process depends on three parameters θ = (σS, σR, CB): the similarity measures in the input (resp. output) space σS, (resp. σR), and the case base CB. The σR measure is usually fixed depending on the task at hand (e.g. classification or regression), but σS and CB have huge ranges of possibilities. Traditionally, knowledge-based methods were used to derive similarity measures for a given case base [Kolodner and Leake, 1996]. In the machine learning community, the metric learning task [Ghojogh et al., 2022] addresses a similar issue, although in a different context, and recent case-based

Co AT Before

Acc. = .95 |CB0| = 150

Acc. = .99 |CBf | = 23

Ct Co AT Before

Acc. = .97 |CB0| = 150

Acc. = 1.00 |CBf | = 23

k NN Before

Acc. = .96 |CB0| = 150

Acc. = .99 |CBf | = 14

SVM RBF Before

Acc. = .97 |CB0| = 150

Acc. = .97 |CBf | = 117

Figure 1: Sculpting the decision frontier by removing cases using Energy Compress: improving both accuracy and inference time through case base compression (Iris dataset, 2D PCA projection).

reasoning (CBR) research applies machine learning methods to learning similarity [Mathisen et al., 2019]. This paper proposes taking the reverse approach: rather than assuming the case base is given and generating a similarity measure for it, it introduces a method, Energy Compress, that learns a case base by starting from a candidate case base and selecting an appropriate subset for a given similarity measure. This can be seen as relating to Richter s [2003] observation that CBR involves multiple knowledge containers, including similarity and case knowledge, interacting such that one can compensate for the other, e.g., here, that a well-chosen case base can compensate for a suboptimal similarity measure. The case base selection task has been studied extensively in case based reasoning research on case base maintenance by compression, which selects subsets of an original case base to retain (e.g. [Juarez et al., 2018], see more details in Sec. 2). However, such work usually makes the implicit assumption that a bigger is better for prediction accuracy: research on case-base compression generally aims at other benefits, such

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

as increased efficiency. In contrast, Energy Compress enables sculpting the decision frontier by removing cases to increase accuracy, in addition to compressing to reduce the inference time. Thus, in contrast to most existing approaches, it does not sacrifice one for the other. This is illustrated in Fig. 1, whick shows the decision regions obtained for the Iris dataset, on a 2D PCA projection, for three CBP algorithms (Co AT, Ct Co AT and 3-NN, see their descriptions in Sec. 4, based on the Euclidean distance) and SVM with a radial basis function (width = 6.4 10 2). Both rows show the prediction regions and their associated accuracy computed on the whole case base, CB0. For each CBP algorithm, the left figure shows the results when the inference algorithm uses the whole case base CB0 while, on the right figure, it only uses CBf, extracted from CB0 by Energy Compress. Here Energy Compress leads to a different decision frontier that achieves a higher accuracy on CB0, for all 4 algorithms, with smaller case bases CBf, (size reduction up to 90% for the 3-NN) and thus decreases inference time. The second main characteristic of Energy Compress is that it is a general method: it can be applied to virtually any CBP algorithm, provided that the algorithm assigns a probability1 to each potential outcome, to provide a tailored case base. Defining a single case base learning strategy applicable to different prediction algorithms remedies one of the main limitations of existing case base learning algorithms, which are dedicated to specific approaches (often to k-NN, see Sec. 2, or to Co AT [Marquer et al., 2023] ) . Energy Compress relies on an original formulation of the CBP task in the framework of energy-based models [Le Cun et al., 2006]. This formulation offers a new and valuable characterization of the notion of competence, in particular at the source case level. The case base learning method then consists of removing cases from a candidate set so as to learn the underlying energy function. This constitutes a generalization of the approach proposed by [Marquer et al., 2023] for the specific case of the Co AT algorithm. The contributions of the paper thus include a general case base learning strategy, that can be applied to a large variety of case-based prediction algorithms and that enables both decreasing inference time and increasing prediction performance, without sacrificing one for the other. Extensive experiments run on 18 UCI datasets demonstrate that Energy Compress obtains very competitive compression ratios while leading to an accuracy (given in %) increase on average of +7.4 for k-NN, +13.3 for Co AT, +26.5 for Ct Co AT, and +3.7 for SVM RBF. Although we focus in this paper on the selection of instances to include in the case base, the experimental results also support that the indicators we define lead to a valuable characterization of the competence notion (see Sec. 3.2 for a discussion). The paper is organized as follows. Sec. 2 discusses related work, both in the domains of case-based prediction and energy-based machine learning. Sec. 3 describes the proposed Energy Compress strategy, discussing the energy-based

1If a CBP algorithm assigns a support to each potential outcome instead of a probability, a probability can be computed as the ratio of the support for each outcome compared to the total support.

formulation of CBP and presenting the induced general competence model as well as the proposed case base learning strategy. Sec. 4 shows how this general case base learning strategy can be implemented for different CBP algorithms. Sec. 5 presents the experimental study of the approach. Finally, Sec. 6 concludes the paper and discusses future work.

2 Related Work This section discusses the position of the Energy Compress method with respect to related works, both with respect to case-based prediction and energy-based learning. Case-based prediction. Case-based reasoning (CBR) relates to the cognitive ability of analogical reasoning and models a special kind of plausible inference principle, according to which if two situations are judged similar, then it is plausible that they will also have similar outcomes, described in CBR by the phrase similar problems have similar solutions [Anthony and Ratsaby, 2015; Leake and Wilson, 1999, inter alia]. Case-based prediction (CBP) algorithms implement such a transfer step for prediction tasks, which corresponds to applying analogical reasoning to classification or regression tasks in machine learning. In CBP methods, analogical transfer is used to complete the description of a new case by direct comparison with similar cases. A recent review of the literature on case-based prediction [Badra and Lesot, 2023] shows that all surveyed implementations of the inference process share a common principle: they optimize a transfer of similarity knowledge, from the situation space to the outcome space. The predicted outcome is the one that optimizes a measure of compatibility between two similarity measures on a case base, following the plausible inference principle stated above. The compatibility measure can take the form of a joint similarity measure, a set of continuity constraints, a set of rules or a global indicator. In this paper, we focus on the approaches that optimize a global indicator computed on the whole case base, showing in Sec. 3.1 that their inference process can then be interpreted as an energy-based inference. The notations used throughout the paper are as follows: S denotes the input space, and R the output space, respectively equipped with the similarity measures σS and σR. An element of S is called a situation, and an element of R an outcome, or a result. A set CB = {(s1, r1), . . . , (sn, rn)} of elements in S R is called a case base. An element ci = (si, ri) CB is called a source case. Let Tref S R be a set of cases called a reference set, and ct = (st, rt) Tref be a reference case. A potential case (st, ˆr) can be constructed from a reference case by associating to the same situation st S, a different outcome ˆr R, ˆr = rt. Let us denote by Predθ a CBP algorithm and θ = (σS, σR, CB) its parameters. The predicted outcome r = Predθ(st) for a new situation st is the one that makes the new similarity relations σR(ri, r ) most compatible with the new similarity relations σS(si, st). Case base maintenance. Research in case-base maintenance has a long history (see [Juarez et al., 2018] for a survey). The origins date back to early work on case selection in the context of the condensed nearest neighbor al-

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

gorithm (CNNR [Hart, 1968]), which aimed at selecting a subset of the initial set of cases still sufficient for k-NN to generate a correct categorization. This process is commonly referred to as compressing the case base. Focuses of case-base maintenance include increasing prediction accuracy, e.g., by removing noisy cases, and removing unnecessary cases to increase the efficiency of source case retrieval [Cummins, 2013]. Much research concerns competence models and competence-based deletion, which prioritizes removal of cases that make the least contribution to overall coverage, considering both their coverage (cases for which they can provide a solution) and reachability (whether they could be solved using other cases) [Smyth and Mc Kenna, 2001]. The aim is to compress the case base while minimizing competence loss; the gold standard for performance is the full case base. These methods depend on the representativeness assumption that existing cases are a good proxy for future cases, to assess their contributions; the proposed Energy Compress method can accommodate this assumption, but does not depend on it2. Besides pioneering algorithms such as RENN [Tomek, 1976] or IB3 [Aha et al., 1991], a recent literature review on instance selection for automatic text classification [Cunha et al., 2023] shows that recent density-based algorithms such as LSSm [Leyva et al., 2015] or XLDIS [Carbonera, 2017] are also effective in retaining accuracy while reducing the training set size and the training process time by discarding noisy or redundant instances.

Relationship to the knowledge containers perspective in CBR. The knowledge of CBR systems is often viewed in terms of a set of four knowledge containers [Richter, 2003]: the case base, the case vocabulary,3 similarity knowledge and case adaptation knowledge, with strengths in one able to compensate for weaknesses in others. For example, having a better case base, with better coverage, may compensate for weak case adaptation knowledge. However, the value of knowledge cannot be judged in isolation; the harmonization of components such as similarity and adaptation knowledge strongly affects performance [Leake and Ye, 2021]. Much research focuses on acquiring similarity knowledge for a given case base (e.g.,[Mathisen et al., 2019]). The work in this paper investigates the contrasting approach of starting from a given similarity measure and selecting a good case base for it.

Energy-based models. This section briefly summarizes the basics of the energy-based framework the paper uses to provide an original solution to the above-mentioned case-base maintenance issues. Inspired by statistical physics, energybased models [Le Cun et al., 2006] specify a probability distribution p(x; θ) = e Eθ(x)/ R e Eθ(x)dx via a parameterized scalar-valued function Eθ(x) called an energy function. In its conditional version, the function Eθ : X Y R

2In Sec. 3.2 we define the competence used by Energy Compress, measured on a set Tref of cases distinct from the case base CB. Under the representativeness assumption, we can have Tref = CB as CB is representative of the distribution of cases. 3Richter defines the vocabulary of a knowledge representation system [as] which data structures and which elements of the structures are used to represent primitive notions. Generally (and here) the attribute-value representation is used, typically in tabular form.

associates each pair (x, y) X Y with a scalar value Eθ(x, y) that represents the compatibility between the input x and the output y under the set of parameters θ. The energy function Eθ takes low values when y is compatible with x, and higher values when y and x are less compatible. The goal of the energy-based inference is to find, among a set of outputs Y, the output y Y that minimizes the value of the energy function: y = arg miny Y Eθ(x, y). Given a family of energy functions Eθ(x, y) indexed by a set of parameters θ, the goal of learning the energy function is to optimize the θ parameters in order to push down (i.e., assign lower energy values to) the points on the energy surface that are around the training samples, and to pull up all other points. Contrastive divergence [Hinton et al., 2006] is a common learning strategy that, given a numerical hyperparameter λ, optimizes a contrastive loss function such as the hinge loss, which is defined, for a training sample (xk, yk) and a generated out of distribution sample (xk, ˆy) by: ℓ(θ, xk, yk, ˆy) = max(0, λ + Eθ(xk, yk) Eθ(xk, ˆy)). The hinge loss associates a loss value to a training sample (xk, yk) whenever its energy is not lower by at least a margin λ than the energy of the incorrect sample (xk, ˆy). In CBR terms, the inputs X correspond to the situations S, and the outputs Y to the outcomes R. If we associate CBP algorithms with an energy function Eθ (we explain how in Sec. 4), the hinge loss associated with a case (sk, rk) and Eθ can be measured. This serves as the foundation for the competence measure we propose in Sec. 3.2.

3 Energy Compress In order to solve the problem of learning an appropriate case base for a given CBP algorithm with chosen similarity measures σS, σR, the proposed Energy Compress strategy relies on an energy-based model of CBP. The general model is described in Sec. 3.1 below, its implementation for four reference specific algorithms is detailed in Sec. 4. This section then describes the induced competence model, which leads to the proposed case base learning strategy.

3.1 An Energy-Based Model A core idea proposed in [Marquer et al., 2023] is that an inference process that optimizes a global indicator of compatibility between two similarity measures on a case base can be interpreted as an energy-based inference. The input space X is the situation space S and the output space Y is the outcome space R. The energy function EP red θ : S R R measures, for Predθ, the compatibility of the outcome similarities with the added situation similarities when a potential new case ˆct = (st, ˆr) is added to the case base. We propose to extend this definition to compute the energy of a set A S R of cases by taking

EP red θ (A) = X

ct=(st,rt) A EP red θ (st, rt).

The goal of the energy-based inference is to find, among the set of potential outcomes ˆr R, the outcome that minimizes the energy function, i.e.,

r = Predθ(st) = arg min ˆr R EP red θ (st, ˆr). (1)

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

The energy function EP red θ is learned by minimizing on a reference set Tref the loss L = P ct=(st,rt) Tref ℓ(θ, ct), where ℓis defined by:

ℓ(θ, ct) = max(0, λ + EP red θ (st, rt) min ˆr =rt EP red θ (st, ˆr)).

3.2 A General Competence Model

The notions of competence of a case or of a case base, and of the influence of a case on the prediction, can be captured in an energy-based framework. Let us denote by D S R a set of instances, and (x, y) D an instance of D. Let Predθ with θ = (σS, σR, CB) be a CBP algorithm that can be expressed as an energy-based inference, i.e., that satisfies Eq. 1 for some energy function EP red θ . The competence of a case base CB is measured w. r. t. a given reference set Tref by taking the (opposite of) the hinge loss on Tref:

C(θ, Tref) = 1 |Tref|

ct Tref ℓ(θ, ct). (2)

The competence of a source case c = (s, r) CB w. r. t. a reference set Tref can be defined as the loss of competence that would happen if this source case was deleted from CB:

C(c, θ, Tref) = C(θ, Tref) C((σS, σR, CB \ {c}), Tref).

This definition also allows defining a notion of competence locally as the contribution of c on each individual reference case ct Tref of the reference set:

influenceθ(c, ct) = ℓ(θ, ct) ℓ((σS, σR, CB \ {c}), ct).

This notion of influence of a case on the prediction can be seen as a continuous counterpart of the popular notion of case coverage introduced in [Smyth and Mc Kenna, 2001] for k-NN. The influence of a case is computed here with respect to a reference set Tref, and not to the case base CB. The case influences are aggregated to measure the competence of a case base on the whole reference set, as well as the competence of each source case. In this paper, we focus on the application of this competence model to learn the most competent case base for a given prediction algorithm Predθ.

3.3 The Case Base Learning Strategy

The case base learning strategy consists in iteratively deleting the least competent source case from a candidate set CB0. As the competence of a case base is defined directly from the loss function of the underlying energy-based model (Eq. 2), selecting the most competent case base enables learning the energy function EP red θ by optimizing the CB parameter. The Energy Compress case deletion procedure is described by Algorithm 1: at each iteration, the source case that contributes least to the competence of the case base CB w.r.t. the reference set Tref is deleted from the case base (line 11). In this algorithm, the returned parameter θf = (σS, σR, CBf) is the one that maximizes accuracy (function Acc, line 6) on the reference set Tref. The computational complexity of Energy Compress is in O(|CB| |Tref| |R|) O(E), where E is the computational complexity of the energy computation.

Algorithm 1 Energy Compress case deletion procedure. C(c, θ, Tref) depends on Pred and the associated EP red θ .

1: function ENERGY COMPRESS(σS, σR, CB0, Tref) 2: CB = CB0 3: max acc = 1 4: repeat 5: θ = (σS, σR, CB) 6: acc = Acc(Predθ(Tref)) 7: if acc max acc then 8: max acc = acc 9: θf = θ 10: end if 11: CB = CB \ {arg minc CB C(c, θ, Tref)} 12: until |CB| = 0 13: return θf 14: end function

4 Application to Various CBP Algorithms

For Energy Compress to be applicable to a given CBP algorithm Predθ, the only requirement is that Predθ can be expressed as an energy-based inference, i.e., as an algorithm that predicts the outcome that minimizes an energy function EP red θ (Eq. 1). We show in this section that each of the four CBP algorithms Co AT, Ct Co AT, k-NN, and the AP-Classifier can be expressed as minimizing an energy function. Tab. 1 gives examples of such energy functions.

Co AT. The Co AT case-based prediction algorithm was introduced in [Badra, 2020] and formulated in an energy-based model in [Marquer et al., 2023]. Its energy function ECo AT θ , reported in Tab. 1, simply counts the number of new triplets that are not ordered in the same way by σS and by σR.

Ct Co AT. The Ct Co AT algorithm is a variant of the Co AT case-based prediction algorithm that we introduce in this paper. Its energy function ECt Co AT θ is a continuous version of the Co AT energy function where in a classification setting, each positive triplet (i.e., a triplet (i, j, k) such that σR(ri, rj) = σR(ri, rk)) contributes 1 2 and each negative triplet (such that σR(ri, rj) = σR(ri, rk)) contributes to a value between 0 (easy negatives) and 1 (hard negatives). When σR(ri, rj) < σR(ri, rk), if σS(si, sj) σS(si, sk) (hard negative), then 1 2(1 [σS(si, sj) σS(si, sk)][σR(ri, rj) σR(ri, rk)]) [ 1

2, 1] and the triplet greatly increases the energy, as in ECo AT θ under the same conditions. If σS(si, sj) < σS(si, sk) then the contribution of (i, j, k) is in [0, 1

2[ (easy negative), and the triplet has a low contribution to the energy.

k nearest neighbors (k-NN) alg. The k-NN decision function can be written as r = arg maxˆr R χθ(st, ˆr), where

χθ(st, ˆr) = X

(si,ri) CB 1k σS(st, )(si) σR(ri, ˆr)

is a joint similarity measure that measures the compatibility of σS with σR on CB. In this expression, 1k σS(si, )(s) returns 1 if s belongs to the set of k nearest neighbors of st. The energy function Ek NN θ measures the incompatibility of σS

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Prediction algorithm (Predθ) Energy function EP red θ (st, ˆr)

Co AT ECo AT θ (st, ˆr) = Γ(σS, σR, CB {(st, ˆr)}) Γ(σS, σR, CB), where Γ(σS, σR, CB) =| {((s0, r0), (si, ri), (sj, rj)) CB3 such that σS(s0, si) σS(s0, sj) and σR(r0, ri) < σR(r0, rj)} |

Ct Co AT ECt Co AT θ (st, ˆr) = ΓCt(σS, σR, CB {(st, ˆr)}) ΓCt(σS, σR, CB), where ΓCt(σS, σR, CB) = 1

2 P i,j,k 1 [σS(si, sj) σS(si, sk)][σR(ri, rj) σR(ri, rk)]

k-NN Ek NN θ (st, ˆr) = 1 1 |CB| X

(si,ri) CB 1k σS(st, )(si) σR(ri, ˆr)

AP-Classifier EAP θ (st, ˆr) = 1 1 |CB| X

(si,ri) CB σS(st, si) σR(ri, ˆr)

Probabilistic model: arg maxˆr R P(r|st) EP θ (st, ˆr) = 1 P(ˆr|st)

Table 1: Examples of energy functions that can be learned by Energy Compress.

with σR on CB according to χθ: its value is 1 if σS and σR are incompatible, and then decreases as χθ increases.

Analogical proportion-based classifiers. Analogical proportion-based classifiers (see e.g. [Bounhas and Prade, 2024; Couceiro et al., 2017; Couceiro et al., 2018]) can be modeled as special CBP algorithms that reason on similar differences between instances [Badra and Lesot, 2022]. The case base CB = {ci = (si, ri)} is constructed from all pairs (a, b) of instances of a set of instances D, i.e., ci = (si, ri) = (a b, f(a) f(b)) where f(a) is the outcome of a. A new instance (x, ˆy) added to D leads to a set of |D| potential new cases ˆct = (st, ˆr) with ˆct = (st, ˆr) = (c x, f(c) ˆy), formed by taking all possible triples (a, b, c) from D. The similarity measures σS and σR are chosen such that σS(a b, c x) = 1 iff a : b :: c : x holds, and σR(f(a) f(b), f(c) ˆy) = 1 iff f(a) : f(b) :: f(c) : ˆy holds. The transfer strategy is rule-based, and consists in predicting for a new x the value y that maximizes the number of times that the rule (σS 1) (σR = 1) can be triggered to derive y . This means that the prediction y is the one that maximizes a compatibility measure χθ(st, ˆr) that counts for each potential outcome ˆr the number of new pairs of cases (ci, ˆct) for which σR(ri, ˆr) = 1 and σS(si, st) 1. The AP-Classifier [Bounhas et al., 2017] implements this transfer strategy by adding up, for each potential new case ˆct = (st, ˆr), the similarity values σS(si, st) for all source cases ci = (si, ri) such that σR(ri, ˆr) = 1. The obtained energy function EAP θ resembles the one defined for k-NN (see Tab. 1), but needs to be aggregated on the |D| new cases ˆct resulting from the addition of the instance (x, ˆy). The decision function of the AP-Classifier can be written as:

y = arg min ˆy EAP θ ({ˆct = (c x, f(c) ˆy), c D}).

Although these algorithms also maximize a global compatibility measure χθ and can be interpreted as implementing an energy-based inference, they are not included in the following experiments, due to their high computational cost.

Extension to any probability-based machine learning model. At a very general level, predictive models that provide probabilities for each possible outcome can be interpreted as energy-based models. For our purposes, given a conditional probability model P(ˆr|s) that predicts r = arg maxˆr R P(ˆr|st), we define the energy function of the model as EP θ (st, ˆr) = 1 P(ˆr|st). This definition follows the intuitions of energy-based models: more desirable outcomes are identified by higher probabilities and lower energies.

Support Vector Machines (SVMs). SVMs can be seen as a specific case of the previous probability-based machine learning model. To do so, their posterior can be approximated by Platt scaling [Platt, 1999].

5 Experiments

This section describes the experiments run to validate that (i) Energy Compress substantially increases the performance of CBP algorithms for a fixed similarity measure σS, (ii) the obtained compression ratio is competitive w.r.t. the state of the art, (iii) the method is general, in that it both allows capturing competence for a variety of CBP algorithms and learning a case base tailored for each prediction algorithm4

5.1 Protocol The case base learning methods CNNR, ENN, XLDIS, LSSm, IB3, and Energy Compress were tested on 18 UCI datasets against the prediction algorithms Co AT, Ct Co AT, k NN (with k = 7), as well as SVMs with linear, polynomial, and radial basis function kernels. The 18 datasets, listed in Tab. 2, are attribute-value datasets with 3 to 56 attributes and 2 to 7 classes, used in a classification setting. For each classification task, the initial parameters θ0 = (σS, σR, CB0) are chosen as follows. The outcome space R is the set of class labels ri, and σR is the class membership similarity measure, such that σR(ri, rj) = 1 if ri = rj, and 0 otherwise. For Co AT, Ct Co AT, and k-NN, the

4Code for reproducing the experiments is available at: https:// github.com/EMarquer/Me ATCube/tree/maintenance benchmark

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Dataset # instances # attributes # classes Balance Scale 625 4 3 Br. Can. Wisc. Diag. 569 30 2 Br. Can. Wisc. Prog. 198 33 2 Credit Approval 690 15 2 Dermatology 366 34 6 Glass Identification 214 9 7 Haberman s Survival 306 3 2 Heart Disease 303 14 5 Hepatitis 155 19 2 Ionosphere 351 33 2 Iris 150 4 3 Pima 768 8 2 Teach. Ass. Eval. 151 5 3 Lenses 24 4 3 Liver Disorders 345 6 2 Lung Cancer 32 56 3 Wine 178 13 3 Zoo 101 16 7

Table 2: The 18 UCI datasets used in the experiments.

similarity measure σS is chosen to be the decreasing function σS(si, sj) = e d(si,sj) of the Euclidean distance d. The SVMs use the implementation and default parameters from scikit-learn5 to fit the kernel and to estimate the probabilities. All pairs (Learn, Pred) are formed, where Learn {CNNR, ENN, . . .} is a case base learning method and Pred {Co AT, Ct Co AT,. . .} is a CBP algorithm. For each pair (Learn, Pred), the learning algorithm Learn is applied to learn CBf from the candidate base CB0, with a hinge margin λ = 0.1. The similarity measures σS and σR remain fixed in the process, only the case base CBf is learned. We apply 10-fold cross validation with stratified splitting. For each dataset D, each fold is constructed by sampling three distinct subsets CB0, Tref, and Ttest from D. The candidate base CB0 serves as the initial case base. The reference set Tref is used by Energy Compress to learn CBf from CB0. The test set Ttest is not used for learning, but only to measure the accuracy of each CBP algorithm before and after learning took place. The sizes for these sets are taken to be |Tref| = |Ttest| = min(100, .2 |D|), and |CB0| = min(50, |D| (|Tref| + |Ttest|)). Two quality criteria are considered. The accuracy increase Acc(Predθf (Ttest)) Acc(Predθ0(Ttest)) measures the difference between the initial accuracy Acc(Predθ0(Ttest)) and the accuracy Acc(Predθf (Ttest)) computed with the final parameters θf = (σS, σR, CBf). The relative size |CBf |

|CB0| of the learned case base is the ratio between the size of the learned case base CBf and the size of the initial case base CB0. In order to study the extent to which Energy Compress provides case bases that are tailored for the prediction algorithm using them, we apply the following protocol: for each pair (pred1, pred2) Pred2 (i) we learn a case base using Energy Compress on pred1 and Tref, and (ii) we use the learned case base to train pred2 and observe its performance on Ttest. If the performance with the pair (pred1, pred2) (pred1 = pred2) is significantly lower than with the pairs

5https://scikit-learn.org/stable/modules/generated/sklearn.svm. SVC.html

Learn Pred Acc. increase (%) |CBf | |CB0|

+6.3 7.9 53.6% 12.7 ENN +5.4 9.1 85.7% 18.4 XLDIS +3.8 5.3 62.6% 21.5 LSSm +3.6 6.0 89.5% 13.6 IB3 +1.6 1.8 75.7% 13.8 EC +13.3 11.4 43.7% 13.4

+12.3 17.0 43.9% 24.3 ENN +4.2 6.2 90.3% 19.4 XLDIS +7.8 12.6 69.0% 30.3 LSSm +2.4 4.1 92.9% 12.6 IB3 +1.6 2.1 87.0% 11.1 EC +26.5 18.4 72.7% 17.3

+2.1 5.7 60.2% 12.1 ENN +1.8 4.0 90.3% 15.5 XLDIS +1.4 4.1 86.7% 26.4 LSSm +1.3 1.9 92.7% 10.4 IB3 +1.3 2.4 85.8% 15.6 EC +7.4 4.7 75.0% 15.9

+3.1 3.6 63.0% 9.5 ENN +2.2 2.3 90.2% 10.5 XLDIS +2.0 3.8 78.5% 18.6 LSSm +2.5 3.2 90.0% 11.4 IB3 +2.1 3.0 79.9% 9.7 EC +5.3 7.1 75.8% 17.4

+2.0 5.1 66.2% 9.1 ENN +1.0 1.2 89.7% 8.0 XLDIS +1.4 2.4 77.5% 19.4 LSSm +1.6 3.5 90.6% 12.0 IB3 +1.4 2.6 84.2% 13.1 EC +3.8 5.6 45.2% 18.0

+1.6 3.7 62.3% 10.6 ENN +0.8 1.0 93.1% 7.6 XLDIS +1.3 2.8 83.7% 22.4 LSSm +1.1 2.3 93.0% 12.0 IB3 +1.0 2.4 89.1% 13.2 EC +3.7 6.0 72.9% 17.0

Table 3: Average accuracy increase and relative case base size.

(pred1, pred1) and (pred2, pred2), the case base learned with pred1 (resp. pred2) is more suitable for prediction with pred1 (resp. pred2).

5.2 Results

Increasing performance. Energy Compress (denoted EC in the result table) always increased the performance of the CBP algorithms while substantially reducing the case base size. Tab. 3 shows the average accuracy increase and relative size after compression obtained on the 18 UCI datasets for each case base learning method and each CBP algorithm. The obtained accuracy increase was strictly positive for k-NN, Co AT, and Ct Co AT on all datasets. The average accuracy increase is of +7.4 for k-NN (min 2.5, max 22.5), +13.3 for Co AT (min 3, max 39.7), and +26.5 for Ct Co AT (min 8, max 65.8). The accuracy increase is less significant for SVMs but still always positive or null on all datasets, between 3.7% and 5.3% in average. The reason why Ct Co AT exhibits the

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

CBP algorithm Acc. before Acc. after

Co AT 65.8% 24.9 79.9% 15.6 Ct Co AT 51.5% 25.3 78.3% 15.7 k NN 70.9% 19.1 78.4% 16.0 SVM-linear 72.1% 23.4 77.4% 17.3 SVM-poly 65.3% 21.5 69.1% 16.8 SVM-rbf 72.5% 22.7 76.2% 17.6

Table 4: Average accuracy before and after Energy Compress.

highest increase is probably due to the fact that for the same similarity measure σS, the average performance of Ct Co AT is the lowest before case base learning (Tab. 4), which means that Ct Co AT starts with the highest margin of improvement. Nonetheless, after learning, the algorithm exhibits a performance comparable to an algorithm such as k-NN. With respect to its objective to increase accuracy, Energy Compress far outperforms all methods of the state of the art for k-NN, Co AT, and Ct Co AT: the accuracy increase is doubled compared to other methods on average. All accuracy increases provided in Tab. 3 and the more detailed results given in Tab. 4 are significant, as for each row (180 experiments each) the paired t-tests p-values are of the order of 10 4 or smaller, except for 5 cases where it is of the order of 10 3 (IB3 with SVM-poly, SVM-rbf, and Co AT, and XLDIS with SVM-rbf and k NN). The gain in case base size is also very competitive compared to the state of the art: on average Energy Compress gives the lowest relative case base size for Co AT and SVM-poly, and is among the three lowest case base sizes for the other algorithms, even though in these experiments, the stopping criterion for compression was chosen to optimize accuracy on the reference set, not compression. The results also demonstrate the generality of the approach. Energy Compress can be applied to a variety of prediction algorithms and allows reducing the case base size (and thus, the inference time) for all tested algorithms. The accuracy increase is less significant for SVMs, however, perhaps because SVMs are not relying that much on a complex inference process involving the case base CB at prediction time, so tuning the CB parameter may have less effect on their performance. Additionally, the success of the compression process and the increase in performance validate the proposed general competence model. Despite not using accuracy when computing the competence, for all the considered CBP approaches the cases selected for removal have a similar impact on performance as they have on competence. This follows what was observed for synthetic data with Co AT in [Marquer et al., 2023], and this alignment between predictive performance and the measure of case competence substantiates the intuitions behind using the hinge loss to measure competence. Additionally, the success of the compression process and the increase in performance validates the proposed general competence model.

Case base tailored learning. The case base learned by Energy Compress is tailored for a given CBP algorithm. Fig. 2 illustrates this idea by showing for the Iris dataset the accuracies obtained according to the model used to select the cases and the model used for prediction. The best accuracy is ob-

Co AT Ct Co AT SVM-lin. SVM-poly SVM-rbf k NN Model used for prediction

Model used to select the cases with Energy Compress

Figure 2: Mean accuracy over the 10 fold of cross validation on Iris, when predicting with a different model (horizontal axis) than the one used with Energy Compress (vertical axis).

tained on the diagonal, when these two models coincide. The performance drop between this diagonal and the other pairs is significant (according to paired Student t-tests), with, for instance, for Iris, more than 50% of pairs with p-values under 0.031, and 25% under 0.001.

6 Conclusion and Future Work

Energy Compress is a general case-base learning strategy that enables increasing the performance of a wide variety of casebased prediction algorithms while reducing the size of the case base. The obtained case base is not only smaller, but it is also tailored for the prediction algorithm for which it has been learned. The method is general: it can be applied to any case base prediction algorithm, and even to training data selection for any classifier that assigns a probability (normalized or not) to each potential outcome. The underlying energybased model can be leveraged to capture various competence notions such as the competence of a case or of a case base, or the influence of a particular case on the prediction. We believe that the obtained results (e.g.,+7.4% accuracy in average for the k-NN algorithm) constitute a breakthrough because they demonstrate that the role of the case base is not only to provide valuable knowledge on the task at hand, but a parameter that can be learned to tune the prediction algorithm to the task. These improvements were made possible by the recent progress in the modeling of the case base inference process, but work is needed to better understand this family of algorithms (e.g.,understand why the Ct Co AT algorithm relies so heavily on the quality of the case base). In this paper, σS was fixed, so the next steps will include exploring how case base compression interacts with similarity learning for CBP. Other perspectives include testing on regression scenarios, exploring alternative stopping criteria in Algorithm 1 (e.g.,to favor the compression ratio instead of accuracy), developing and enriching the competence model, running a qualitative analysis of the obtained case bases.

Acknowledgements

This work has been supported by the French National Agency for Research (ANR), namely, projects ANR-22-CE23-0002 (ERIANA), ANR-22-CE23-0023 (AT2TA), and ANR-22CE23-0032 (SMe LT).

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

[Aha et al., 1991] David W. Aha, Dennis F. Kibler, and Marc K. Albert. Instance-based learning algorithms. Machine Learning, 6:37 66, 1991.

[Anthony and Ratsaby, 2015] Martin Anthony and Joel Ratsaby. A probabilistic approach to case-based inference. Theoretical Computer Science, 589:61 75, July 2015.

[Badra and Lesot, 2022] Fadi Badra and Marie-Jeanne Lesot. Co AT-APC: When Analogical Proportion-based Classification Meets Case-Based Prediction. In Workshop on Analogies: From Theory to Applications, ATA@ICCBR, 2022.

[Badra and Lesot, 2023] Fadi Badra and Marie-Jeanne Lesot. Case-based prediction A survey. Int. Journal of Approximate Reasoning, 158, 2023.

[Badra, 2020] Fadi Badra. A Dataset Complexity Measure for Analogical Transfer. In Proc. of the Int. Joint Conf. on Artificial Intelligence, IJCAI 20, pages 1601 1607, 2020.

[Bounhas and Prade, 2024] Myriam Bounhas and Henri Prade. Revisiting analogical proportions and analogical inference. Int. Journal of Approximate Reasoning, 171, 2024.

[Bounhas et al., 2017] Myriam Bounhas, Henri Prade, and Gilles Richard. Analogy-based classifiers for nominal or numerical data. Int. Journal of Approximate Reasoning, 91:36 55, 2017.

[Carbonera, 2017] Joel Luis Carbonera. An efficient approach for instance selection. In Da Wa K, volume 10440 of Lecture Notes in Computer Science, pages 228 243. Springer, 2017.

[Couceiro et al., 2017] Miguel Couceiro, Nicolas Hug, Henri Prade, and Gilles Richard. Analogy-preserving functions: A way to extend boolean samples. In Proc. of the Int. Joint Conf. on Artificial Intelligence, IJCAI 17, pages 1575 1581, 2017.

[Couceiro et al., 2018] Miguel Couceiro, Nicolas Hug, Henri Prade, and Gilles Richard. Behavior of analogical inference w.r.t. boolean functions. In Proc. of the Int. Joint Conf. on Artificial Intelligence, IJCAI 18, pages 2057 2063, 2018.

[Cummins, 2013] Lisa Cummins. Combining and Choosing Case Base Maintenance Algorithms. Ph D thesis, University College Cork, 2013.

[Cunha et al., 2023] Washington Cunha, Felipe Viegas, Celso Franc a, Thierson Rosa, Leonardo Rocha, and Marcos Andr e Gonc alves. A comparative survey of instance selection methods applied to nonneural and transformerbased text classification. ACM Computing Surveys, 2023.

[Ghojogh et al., 2022] Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, and Mark Crowley. Spectral, probabilistic, and deep metric learning: Tutorial and survey. In ar Xiv:2201.09267, 2022.

[Hart, 1968] Peter Hart. The condensed nearest neighbor rule. IEEE Transactions on Information Theory, 14, 1968. [Hinton et al., 2006] Geoffrey Hinton, Simon Osindero, Max Welling, and Yee-Whye Teh. Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation. Cognitive Science, 30(4):725 731, 2006. [Juarez et al., 2018] Jose M. Juarez, Susan Craw, J. Ricardo Lopez-Delgado, and Manuel Campos. Maintenance of Case Bases: Current Algorithms after Fifty Years. In Proc. of the Int. Joint Conf. on Artificial Intelligence, IJCAI, pages 5457 5463, 2018. [Kolodner and Leake, 1996] Janet Kolodner and David Leake. A tutorial introduction to case-based reasoning. In Case-Based Reasoning: Experiences, Lessons, and Future Directions, pages 31 65. AAAI Press, 1996. [Leake and Wilson, 1999] D. Leake and D. Wilson. When experience is wrong: Examining CBR for changing tasks and environments. In Case-Based Reasoning Research and Development, ICCBR 1999, pages 218 232, Berlin, 1999. Springer. [Leake and Ye, 2021] David Leake and Xiaomeng Ye. Harmonizing case retrieval and adaptation with alternating optimization. In Case-Based Reasoning Research and Development, ICCBR 2021, pages 125 139. Springer, 2021. [Le Cun et al., 2006] Yann Le Cun, Sumit Chopra, Raia Hadsell, Marc Aurelio Ranzato, and Fu Jie Huang. A Tutorial on Energy-Based Learning. In Predicting Strutured Data, page 59. MIT Press, 2006. [Leyva et al., 2015] Enrique Leyva, Antonio Gonz alez Mu noz, and Ra ul P erez. Three new instance selection methods based on local sets: A comparative study with several approaches from a bi-objective perspective. Pattern Recognition, 48(4):1523 1537, 2015. [Marquer et al., 2023] Esteban Marquer, Fadi Badra, Marie Jeanne Lesot, Miguel Couceiro, and David Leake. Less is Better: An Energy-Based Approach to Case Base Competence. In Workshop on Analogies: From Theory to Applications, ATA@ICCBR, 2023. [Mathisen et al., 2019] Bjørn Magnus Mathisen, Agnar Aamodt, Kerstin Bach, and Helge Langse th. Learning similarity measures from data. Progress in Artificial Intelligence, 10 2019. [Platt, 1999] John Platt. Probabilities for SV Machines. In Advances in Large Margin Classifiers, volume 10 (3), pages 61 74, January 1999. [Richter, 2003] Michael M Richter. Knowledge Containers. In Readings in Case-Based Reasoning, 2003. [Smyth and Mc Kenna, 2001] Barry Smyth and Elizabeth Mc Kenna. Competence Models and the Maintenance Problem. Computational Intelligence, 17(2):235 249, 2001. [Tomek, 1976] Ivan Tomek. An Experiment with the Edited Nearest-Neighbor Rule. IEEE Transactions on Systems, Man, and Cybernetics, SMC-6(6):448 452, 1976.

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)