# iterated_belief_change_as_learning__2dcc1c33.pdf Iterated Belief Change as Learning Nicolas Schwind1 , Katsumi Inoue2 , Sébastien Konieczny3 and Pierre Marquis3,4 1National Institute of Advanced Industrial Science and Technology, Tokyo, Japan 2National Institute of Informatics, Tokyo, Japan 3Univ. Artois, CNRS, CRIL, Lens, France 4Institut Universitaire de France nicolas-schwind@aist.go.jp, inoue@nii.ac.jp, konieczny@cril.fr, marquis@cril.fr In this work, we show how the class of improvement operators a general class of iterated belief change operators can be used to define a learning model. Focusing on binary classification, we present learning and inference algorithms suited to this learning model and we evaluate them empirically. Our findings highlight two key insights: first, that iterated belief change can be viewed as an effective form of online learning, and second, that the well-established axiomatic foundations of belief change operators offer a promising avenue for the axiomatic study of classification tasks. 1 Introduction Belief Change Theory [Alchourrón et al., 1985; Gärdenfors, 1988; Katsuno and Mendelzon, 1991] provides a principled framework for modifying an agent s current beliefs in response to new information. Iterated belief revision [Darwiche and Pearl, 1997; Jin and Thielscher, 2007; Booth and Meyer, 2006] extends this framework to accommodate sequences of new information, addressing the challenge of revising an agent s beliefs over time. In both cases, the ultimate goal is to improve the agent s beliefs to better reflect the real world. While the methodologies of these two approaches differ, their objective aligns with that of Machine Learning (ML): deriving an accurate approximation of the real world from data. Despite this conceptual similarity, connections between Belief Change Theory and ML remain largely unexplored, apart from a few notable contributions in philosophical logic [Kelly, 1998; Kelly, 2014; Baltag et al., 2011; Baltag et al., 2019], inductive logic programming [Wrobel, 1994; Pagnucco and Rajaratnam, 2005], and computational learning theory [Goldsmith et al., 2004]. A major difference is that primacy of update, which requires fully adopting new information after each revision, is a key principle in belief revision. However, this principle is incompatible with typical ML scenarios involving noisy data, as it leads to substantial changes in the agent s epistemic state at each learning step. Improvement operators [Konieczny and Pino Pérez, 2008; Konieczny et al., 2010], which generalize iterated belief revision, relax the primacy of update. Soft improvement oper- ators [Konieczny and Pino Pérez, 2008], in particular, allow incremental changes that better reflect the iterative nature of learning. When the same information is encountered again, its reliability is slightly adjusted. This mirrors the behavior of online ML methods for classification, where each labeled example causes gradual changes to the estimated probabilities of the classes. Figure 1 illustrates this analogy. Most ML methods also adjust examples similar to the labeled example, those near the observed instance. A comparable mechanism can be introduced into the iterated belief change framework: both the observed example and its neighbors can have their reliability updated through an improvement operator. This paper focuses on binary classification and shows that improvement-based models built this way deliver reasonable learning performance. We compare them to standard ML methods on benchmark datasets. Results show the improvement-based model slightly outperforms Naive Bayes and achieves better recall than most existing methods. Thus, soft improvement operators offer a promising new approach to learning from examples. Although an initial exploration, this connection is important for both KR and ML communities. It links two fundamentally different tasks belief revision and supervised learning with distinct goals and methods. From the ML side, this work opens the way to learning models that emphasize interpretability and offer strong guarantees linking the model to the data. These properties are crucial for trustworthy AI but often lacking in current ML models. If rationality principles are established, they could serve as safeguards to ensure the model evolves correctly with new evidence. The proofs and code used to retrieve datasets and conduct experiments are available in [Schwind et al., 2025]. 2 Preliminaries We assume the reader familiar with basics of ML, including standard models (see, e.g., [Shalev-Shwartz and Ben-David, 2014] for an introduction). In this work, the focus is laid on tabular datasets, where instances are represented as vectors of Boolean features, aligning with many learning methods, especially in data mining. Such a Boolean encoding aids normalization, which often improves model performance. Although features in tabular data are usually not Boolean in Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Currently Learned Model Labeled Instance Online ML Model Epistemic State New Evidence Soft Improvement Operator Input Input Figure 1: Analogies between classifier learning models and belief change operators. essence, they can be converted accordingly. Categorical features are transformed via standard one-hot encoding, creating a Boolean feature for each domain value. Numerical features are converted by selecting thresholds within their domain to produce Boolean features. These thresholds are chosen by analyzing the data distribution (e.g., percentiles) or by optimizing split metrics like Gini impurity or entropy, as commonly done in tree-based models such as decision trees, random forests, and boosted trees. Let L be a propositional language built from a finite set of propositional variables P, standard logical connectives, and the Boolean constants (false) and (true). A world on P is a mapping from P to {0, 1}, and Ωdenotes the set of all such worlds. A world ω satisfies a formula ϕ if ω makes ϕ true; the set of such worlds is denoted by [ϕ]. A formula is consistent if [ϕ] = , and it is complete if [ϕ] = {ω} for some ω Ω. The symbol |= denotes logical entailment, and logical equivalence: ϕ |= ψ iff [ϕ] [ψ], and ϕ ψ iff [ϕ] = [ψ]. Iterated Belief Change provides a principled framework for modeling how a rational agent s beliefs evolve with successive pieces of evidence. An agent s belief state, called an epistemic state, includes both the agent s current beliefs and conditional information that guides how beliefs should change in response to new inputs. Formally, an epistemic state can be any object Ψ, from which the agent s current beliefs are extracted via a mapping Bel, such that Bel(Ψ) L [Darwiche and Pearl, 1997]. An epistemic space is then a tuple E = E, Bel , where E is the set of all epistemic states in the space [Schwind et al., 2022]. A standard example of epistemic space is built with Ordinal Conditional Functions (OCFs) [Spohn, 1988; Williams, 1995]. An OCF κ is a mapping associating each world with a non-negative integer 1 such that κ(ω) = 0 for some world ω. Definition 1 (OCF epistemic space). The OCF epistemic space is the epistemic space Eocf = Eocf, Belocf where: Eocf is the set of all OCFs over Ω; Belocf is the mapping associating each OCF κ from Eocf with a formula ψ L such that [ψ] = {ω Ω| κ(ω) = 0}. Given an epistemic space E = E, Bel , an iterated belief change operator on E is a mapping associating each 1In the original definition OCFs are defined on ordinals [Spohn, 1988], but here, as in most cases, integers suffice. epistemic state Ψ E and each formula µ L with a new epistemic state Ψ µ E, i.e., : E L E. Improvement operators [Konieczny and Pino Pérez, 2008; Konieczny et al., 2010; Medina Grespan and Pino Pérez, 2013] generalize iterated revision operators [Darwiche and Pearl, 1997] by dropping the success postulate (R*1), which requires the revised beliefs to entail the input formula. They are defined by nine rationality principles (I1) (I9) (see [Konieczny and Pino Pérez, 2008; Konieczny et al., 2010] for details). We recall only the weak primacy of update: (I1) k N s.t. Bel(Ψ k α) |= α, where Ψ 1α = Ψ α, and if k > 1, Ψ k α = (Ψ k 1α) α. This property states that after receiving α repeatedly, it eventually becomes believed. We now introduce a simple example of an improvement operator defined over the OCF epistemic space: Definition 2 (Basic shifting operator +1). The basic shifting operator +1 on Eocf is defined for any OCF κ and formula α by κ = κ +1 α, where for each world ω Ω: κ (ω) = κ(ω) x if ω [α] κ(ω) + 1 x otherwise, where x = 0 if Bel(κ) α |= , and x = 1 otherwise. Upon receiving new input α, the improvement operator +1 adjusts κ by increasing the value of worlds ω [ α] by 1 (i.e., κ (ω) = κ(ω) + 1), while leaving κ(ω) unchanged for worlds ω [α]. A normalization step ( x) then ensures min{κ (ω) | ω Ω} = 0, as required by the definition of an OCF. This operator resembles Spohn s n-conditionalization [Spohn, 1988], except n here depends on the prior plausibility of α (shifted by 1). It also relates to the one-improvement operator in [Konieczny and Pino Pérez, 2008]. 2.1 Morphological Dilation and Erosion An essential component for defining our learning operators is the notion of formula dilation and, dually, formula erosion [Bloch and Lang, 2002]. Both rely on a neighborhood B, which is a mapping associating each world ω Ωwith a set of worlds Bω Ωsuch that (i) ω Bω, and (ii) ω Bω implies that ω Bω , for all worlds ω, ω Ω. A neighborhood B induces a dilation δB and an erosion ϵB, both mappings from formulas to formulas, defined for each ϕ L as follows: [δB(ϕ)] = {ω Ω| Bω [ϕ] = } [ϵB(ϕ)] = {ω Ω| Bω [ϕ]} For k 0, the k-dilation δk B(ϕ) and k-erosion ϵk B(ϕ) are defined inductively as follows: δ0 B(ϕ) = ϕ, and δk B(ϕ) = δB(δk 1 B (ϕ)) for k > 0. Similarly, ϵ0 B(ϕ) = ϕ, and ϵk B(ϕ) = ϵB(ϵk 1 B (ϕ)) for k > 0. Neighborhoods can be derived from various similarity or distance measures on Boolean vectors (see [Choi et al., 2009] for an overview of such measures). In the present setting, the neighborhood Bω of each world ω depends solely on ω and need not be defined uniformly over Ω. Leveraging such instance-specific neighborhoods is known to be beneficial from a learning perspective [Ye et al., 2016]. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) 3 From Epistemic Spaces to Classifier Spaces In this section, we formalize the change operation underlying the learning process of a binary classifier, drawing on the framework of iterated belief change introduced earlier. The binary classification task consists in predicting whether a given instance is a member of some target class or concept. We assume that each instance is described by a vector of values associating each feature xi from a given finite and fixed set X with a value. We focus on binarized instance descriptions, so that the feature set X consists of a set of binary features. An instance can be associated with an output (1 or 0), characterizing whether the instance is predicted as positive or negative. We represent each binary feature as a propositional variable, so that the feature set X corresponds to the set PX = {x1, . . . , xn}. A world over PX is called an instance description. Let ΩX be the set of all such instance descriptions, i.e., all worlds over PX, and let LX denote the propositional language generated from PX using the standard connectives. Each formula ϕ LX can represent a concept predicted by a binary classifier: an instance description ωX ΩX satisfies ϕ if and only if it is classified as positive. This modeling approach follows that of [Schwind et al., 2023]. To represent labeled data, as in training examples, we extend PX with an additional output (class) variable y indicating whether an instance is positive (1) or negative (0). Formally, we define P = PX {y}, with y / PX. A world ω over P is called a labeled instance, and Ωdenotes the set of all such labeled instances. A labeled instance ω is positive if ω(y) = 1 and negative if ω(y) = 0. For any labeled instance ω Ω, its feature description (i.e., the corresponding world over PX) is denoted ωX. The complete formula ϕω L such that [ϕω] = {ω} is called a training instance, with ω its associated labeled instance. Let Lc denote the set of all such training instances. A dataset D is defined as a finite sequence of training instances: D = (ϕs ω)1 s m, and D denotes the set of all possible datasets. To model binary classifiers, we move beyond treating a classifier as a static propositional formula (as in [Schwind et al., 2023]) and instead adopt a dynamic perspective that captures the learning process itself. This shift mirrors the move from one-step belief change where an agent s epistemic state is represented by a single propositional formula [Katsuno and Mendelzon, 1991] to iterated belief change, where epistemic states have a richer structure and yield formulas via the mapping Bel [Darwiche and Pearl, 1997]. Similarly, modeling the evolution of a classifier during learning requires a more expressive framework. To address this, the notion of an epistemic state is lifted to that of a binary classifier state. Likewise, the mapping Bel, which assigns each epistemic state with a propositional formula representing the agent s beliefs, is lifted to a mapping Pos, which assigns each classifier state with a formula characterizing the concept it predicts. The pair consisting of a classifier state and the mapping Pos thus forms a classifier space, in direct analogy with an epistemic space. The distinct notation Pos highlights the shift in perspective: from beliefs held by an agent to predictions made by a classifier. Formally: Definition 3 (Classifier space). A classifier space is a tuple E = E, Pos , where E is a set of binary classifiers and Pos is a mapping from E to LX. Thus, the formula Pos(Ψ) represents the concept predicted by Ψ: the set [Pos(Ψ)] contains the instance descriptions that Ψ classifies as positive, while [ Pos(Ψ)] contains those classified as negative. Note that Pos(Ψ) is not required to be a consistent formula, i.e., [Pos(Ψ)] may be empty. Standard epistemic spaces used for defining improvement operators can be directly adapted to build classifier spaces. Given an epistemic space E = E, Bel , one can construct a classifier space E = E, Pos by simply setting Pos = Bel. However, more meaningful examples of classifier spaces will be introduced in the next section. Based on classifier spaces, we define an (online) learning operator: Definition 4 (Learning operator). Let E = E, Pos be a classifier space. A learning operator on E is a mapping : E Lc E. Thus, specifies how a classifier Ψ E changes upon receiving a training instance ϕω Lc, yielding a new classifier Ψ ϕω. The framework is general enough to capture any binary classifier and online learning process, assuming binarized training data. We now have the tools to define what we call a learning framework, representing a full learning process: Definition 5 (Learning framework). A learning framework is a pair (Ψ , ), where Ψ E is a binary classifier called the anchor, and is a learning operator on E = E, Pos . The anchor Ψ serves as the initial classifier state, providing the starting point for learning. Thus, a learning framework is fully determined by two components: the learning operator , which governs how classifiers evolve, and the anchor Ψ , which sets the initial state. Given a learning framework (Ψ , ) and a training dataset D D, the resulting learned classifier, denoted Ψ D, is defined inductively by:2 Ψ = Ψ Ψ (D (ϕω)) = (Ψ D) ϕω The choice of anchor depends on the context: it may be a classifier pre-trained on prior data, an untrained classifier (e.g., initialized as trivially positive, Pos(Ψ ) , or negative, Pos(Ψ ) ), or a random element from E. This formalization captures the iterative nature of online binary classifier learning and leads into the next section, where we introduce a specific class of learning frameworks. 4 Improvement-Based Learning In this section, we introduce a concrete class of improvementbased learning operators. These operators are defined on a classifier space that extends the OCF epistemic space. In this space, each binary classifier is represented as a tuple (D, κ, τ), consisting of a training dataset D, an OCF κ, and a threshold τ R. Such classifiers are called TOCFS: 2 denotes vector concatenation. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Definition 6 (TOCF classifier space). The TOCF classifier space is the tuple Etocf = Etocf, Postocf , where: Etocf = D Eocf R is the set of all TOCFS Postocf is the mapping associating each TOCF (D, κ, τ) Etocf with a formula ψ LX such that [ψ] = {ω ΩX | κ(ω) τ}. In this setting, a binary classifier is thus a TOCF Ψ = (D, κ, τ). The dataset D represents the training data used to construct Ψ. The OCF κ assigns a value to each instance description ωX ΩX, with lower values indicating greater plausibility that the instance is classified as positive by Ψ. The threshold τ specifies the classification boundary: an instance ωX is classified as positive (i.e., ωX [Pos(Ψ)]) if κ(ωX) τ, and as negative otherwise. Our concrete class of learning operators is defined on the TOCF classifier space. Every operator in this class is specified by the following components: an improvement operator on the OCF epistemic space, a neighborhood B on ΩX (see Sec. 2.1), used to characterize formula dilation and erosion, and a performance metric m : N4 R. In a nutshell, the improvement operator and the neighborhood B are used together to adjust the plausibility of worlds in the underlying OCF κ of a TOCF Ψ = (D, κ, τ), while the performance metric m sets the threshold τ that best separates predicted positive and negative instances. Given an improvement operator , a neighborhood B, and a performance metric m, we denote the corresponding learning operator by ( ,B,m). Then, for any TOCF Ψ = (D, κ, τ) and any training instance ϕω, the resulting TOCF Ψ = (D , κ , τ ) = Ψ ( ,B,m) ϕω is defined as follows. First, we set D = D (ϕω), that is, we augment the current training dataset by including the new instance ϕω. Second, we build the new OCF κ using the improvement operator and the neighborhood B. This construction proceeds differently depending on whether the training instance ϕω is labeled as positive or negative. Learning from a positive instance. If the training instance ϕω satisfies ϕω |= y, the process unfolds as follows. Let ϕωX Lc X be the formula describing the instance (recall that ωX denotes its feature description). We begin by applying the improvement operator to ϕωX in κ, yielding κ ϕωX. We then continue this process iteratively on successive dilations of ϕωX: first on δB(ϕωX), then on δ2 B(ϕωX), and so on, until reaching a fixed point. Semantically, this process increases the plausibility of ω in κ, followed by an increase in the plausibility of its neighborhood, and then of neighborhoods of neighborhoods, so that the increase in plausibility assigned to a world becomes smaller as its distance from ω grows. Worlds that are not reachable from ω (that is, those not included in any dilation δk B(ϕωX) for any k) remain unaffected; their plausibility in κ stays the same. Formally, when ϕω |= y, we define the resulting OCF as κ = κ + ϕωX = κ n + ϕωX, where κ n + ϕωX is inductively defined as follows: κ 0 + ϕωX = κ, κ k+1 + ϕωX = (κ k + ϕωX) δk B(ϕωX) for k 0, n = min({k N | [δn+1 B (ϕωX)] {[δn B(ϕωX)], ΩX}}) Learning from a negative instance. Conversely, when the training instance is negative (ϕω |= y), the process targets ϕωX and its successive erosions: ϵB( ϕωX), ϵ2 B( ϕωX), and so on. At each step, plausibility is increased for the worlds satisfying the current erosion, continuing until reaching a fixed point (or stopping just before if the fixed point is an inconsistent formula). Semantically, this increases the plausibility of worlds that are farther from ωX under B, with the increase being greater the farther the world is. The world ωX itself remains unaffected. Formally, we define κ = κ ϕωX = κ n ϕωX, where κ n ϕωX is defined inductively as follows: κ 0 ϕωX = κ, κ k+1 ϕωX = (κ k ϕωX) ϵk B( ϕωX) for k 0, n = min({k N | [ϵn+1 B ( ϕωX)] {[ϵn B( ϕωX)], }}) An alternative would be to handle negative instances using a decrement operator applied to ϕωX and its successive dilations, mirroring the treatment of positive instances. While such decrement operators have been proposed [Sauerwald and Beierle, 2019], we do not adopt them here because, unlike the well-established duality between erosion and dilation, no formal duality exists between improvement and decrement operators. By relying on erosion and dilation dual operations defined over a single neighborhood B we maintain a unified learning mechanism using the same improvement operator for both positive and negative instances. Given the distinction outlined above between learning from positive and negative instances, the new OCF is defined as: κ = κ ϕω = κ + ϕωX if ϕω |= y, κ ϕωX if ϕω |= y Third, we define the threshold τ , which acts as the optimal separator. Recall that τ partitions the space of instance descriptions such that instances with plausibility less than or equal to τ are classified as positive, and those with plausibility greater than τ are classified as negative. According to Def. 6, this corresponds to [Pos(Ψ )] = {ωX | κ (ωX) τ }. We aim to select τ so as to optimize performance over the dataset D , using the performance metric m. For any candidate threshold τ N, we compute a confusion matrix cm(τ) = (tp, fp, tn, fn) as follows: tp = {|ωX ΩX | ϕω D , ϕω |= y, κ (ωX) τ|} fp = {|ωX ΩX | ϕω D , ϕω |= y, κ (ωX) τ|} tn = {|ωX ΩX | ϕω D , ϕω |= y, κ (ωX) > τ|} fn = {|ωX ΩX | ϕω D , ϕω |= y, κ (ωX) > τ|} The final threshold τ is selected as the median among all integer thresholds τ [0, up] having the maximum score m(cm(τ)), where up = arg max({κ (ωX) | ωX ΩX}):3 τ = median(arg max({m(cm(τ)) | τ N})) 3It suffices to search in [0, up]: thresholds τ < 0 yield the same confusion matrix as τ = 0, and τ > up as τ = up. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Having characterized both the new OCF κ and the threshold τ , we now summarize the complete definition of the improvement-based learning operator ( ,B,m): Definition 7 (Improvement-based learning operator ( ,B,m)). Let be an improvement operator, B be a neighborhood, and m be a performance metric. The improvement-based learning operator induced by , B, and m, and denoted by ( ,B,m), is defined on the TOCF classifier space Etocf as follows. For each binary classifier (D, κ, τ) Etocf and each training instance ϕω Lc, we define (D, κ, τ) ( ,B,m) ϕω = (D , κ , τ ), where: ( D = D (ϕω) κ = κ ϕω τ = arg max({m(cm(τ)) | τ N}) To define the full learning framework (cf. Def. 5), we must specify an anchor serving as an initial state. We choose Ψ = ( , κ , 0), where κ is the constant OCF assigning 0 to every world ωX ΩX. Hence, the full improvement-based learning framework is given by the pair (Ψ , ( ,B,m)). 5 An Instantiated Learning Operator We now present a concrete instantiation of an improvementbased learning operator, ( +1,BH,mba). This operator is fully defined by the following choices: +1, the basic shifting improvement operator (cf. Sec. 2) BH, that defines pairs of worlds as direct neighbors when they differ on at most one variable, i.e., for each ωX ΩX: BH ωX = {ω X | |{xi PX | ω X(xi) = ωX(xi)}| 1} mba, the balanced accuracy metric, defined for each confusion matrix (tp, fp, tn, fn) as: mba(tp, fp, tn, fn) = 1/2(tp/(tp + fn) + tn/(tn + fp)) The choice of balanced accuracy is illustrative rather than principled. Nonetheless, it is particularly appropriate for imbalanced datasets, where one class outweighs the other. By giving equal importance to the true positive rate and the true negative rate, balanced accuracy ensures a more equitable evaluation of classifier performance across both classes. Representation of (Ψ , ( +1,BH,mba)). We now show that, in this specific learning framework, both a training algorithm and an inference algorithm (predicting the label of any instance ωX ΩX from a trained classifier) can be designed. Both algorithms run in polynomial time with respect to the number of training instances and the number of features. Given an instance description ωX ΩX, let ωX ΩX be defined as ωX(xi) = 1 ωX(xi), for each xi PX. Let d H : ΩX ΩX N be the Hamming distance between instance descriptions, i.e., for all ωX, ω X ΩX, d H(ωX, ω X) = |{xi PX | ωX(xi) = ω X(xi)}|. We extend the Hamming distance to define a distance between any instance description ω X and dataset D as d H(ω X, ) = 0, and if D = : d H(ω X, D) = X {d H(ω X, ϕω) | ϕω D}, with d H(ω X, ϕω) = d H(ω X, ωX) if ϕω |= y d H(ω X, ωX) if ϕω |= y On the other hand, given any dataset D, let ωD X ΩX be any instance description such that for each xi PX: ωD X(xi) = 1 if D1(xi) D0(xi) 0 otherwise, where D1(xi) = |{ϕω D | ϕω |= y iff ω(xi) = 1}| and D0(xi) = m D1(xi) (recall that m is the number of training instances from D.) We can show that such an instance description ωD X is one of the closest instance description to the dataset D in terms of Hamming distance:4 Lemma 1. For each dataset D D and each instance description ωX ΩX, we have d H(ωD X, D) d H(ωX, D). Then, we can show that for each instance description ωX ΩX the value κ(ωX) can always be characterized via computations of Hamming distances to the currently trained dataset: Proposition 1. For each TOCF Ψ = (D, κ, τ) and each instance description ωX ΩX, we have that: κ(ωX) = d H(ωX, D) d H(ωD X, D) This has several important implications for our learning framework (Ψ , ( +1,BH,mba)) in terms of computational complexity. Recalling that n is the size of the feature set and m is the number of training instances from D, we have: Proposition 2. Let D D, and let Ψ ( +1,BH,mba) D = Ψ = (D, κ, τ). 1. τ can be computed in time O(n m2) 2. given (D, τ), for each ωX ΩX, deciding whether ωX [Pos(Ψ)] can be done in time O(n m) As a consequence of Proposition 2, in the learning framework (Ψ , ( +1,BH,mba)), we obtain both a learning algorithm (the computation of τ) and an inference algorithm (the prediction for any instance ωX ΩX given (D, τ)) that run in polynomial time with respect to the number of training instances and the number of features (Proposition 2.1 and 2.2, respectively). Since inference can be performed in polynomial time from (D, τ) alone, the learned classifier (D, κ, τ) is fully characterized by (D, τ), i.e., the OCF κ is not needed. Another noteworthy property of this learning framework is its robustness to dataset permutations. Given any dataset D = (ϕs ω)1 s m, let Dπ = (ϕπ(s) ω )1 s m denote the dataset obtained by applying a permutation π : {1, . . . , m} {1, . . . , m}: Proposition 3. For any permutation π : {1, . . . , m} {1, . . . , m} and any dataset D D, Ψ ( +1,BH,mba) D = Ψ ( +1,BH,mba) Dπ 4All proofs are available online [Schwind et al., 2025]. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Figure 2: A comparison of performance between all learning models across all datasets. In other words, the learned classifier is invariant under reordering of the training instances. This is significant, as many standard learning models, such as decision trees, random forests, and k-nearest neighbors (k-NN), may produce different outputs depending on the order in which the data is processed. This corresponds to the commutativity postulate for improvement operators [Schwind and Konieczny, 2020], which states that no piece of information should be treated as more important than another when the order of input carries no meaningful priority. 6 Experiments Empirical protocol. We implemented in Python the improvement-based learning model ( +1,BH,mba) (simply denoted by onward). Its predictive performance was compared against the following standard ML models (learned using the scikit-learn library [Pedregosa et al., 2011] and considering default parameters): Logistic Regression, Naive Bayes, Decision Tree, Gradient Boosting, KNN, Neural Network, Random Forest, and SVM. In addition, a trivial learning model (named Maj) always predicting the majority class (positive or negative) in the training dataset was considered as a baseline. The evaluation considered a large set of performance metrics that are standard in ML: balanced accuracy, accuracy, F1 score, Jaccard index, Matthews correlation coefficient (MCC), precision, recall, and ROC-AUC (receiver operating characteristic area under the curve). The experimental protocol involved selecting 58 binary classification datasets from the UCI repository,5 with each dataset containing up to 12,684 instances and up to 1,203 numerical or categorical features. A 10-fold cross-validation has been conducted: each dataset was split into ten random samplings, with a 90%/10% division for training and test sets. Missing values in the data were imputed by filling each numerical feature with the mean value from the dataset and each 5https://archive.ics.uci.edu/datasets/ categorical feature with the most frequent value. The datasets were then standardized using a standard scaler. For our learning framework , numerical attributes were further re-scaled linearly to the interval [0, 10] with integer values, while categorical attributes were left unchanged. For these experiments, binarizing the features was not mandatory: instead, we used a modified weighted Hamming distance between instance descriptions d H(ωX, ω X). This was made only for convenience as this is equivalent in terms of performance and computational complexity to binarizing all datasets in such a way to ensure that d H = d H. Empirical results. We did an extensive comparative analysis which is visualized through various plots (the full set is available in [Schwind et al., 2025]). Box plots were generated for each metric, showing model performance across the 58 datasets, scatter plots compared the proposed model with each baseline, and spider plots have been used to compare the performance of all models for individual datasets and in an aggregate view. We present some of them hereafter. The spider plots in Figure 2 show in a synthetic way the average predictive performances (over the 58 datasets) of the ten ML models learned for each of the ten families of models considered in our experiments and assessed for each of the eight metrics (our learning framework is simply named OCF in the figures). One can see that outperforms the baseline Maj model (which is mandatory to be considered as a significant learning operator). It turns out that in practice achieves performances that are similar to the ones of Naive Bayes. Finally, performs slightly better than the other models when recall is used to measure the predictive performance. This is an important point since high recall is expected in applications where missing a positive case can be costly, such as in healthcare or when dealing with attack detection. Focusing on the balanced accuracy metric, Figure 3 (left) presents a scatter plot for contrasting in a more precise way the predictive performances of and Naive Bayes; and Figure 3 (right) presents boxplots representing the distributions of predictive performance achieved over all datasets by each Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Figure 3: A comparison between our learning model and the Naive Bayes learning model across all datasets (left), and a comparison of performance between all learning models across all datasets (right), both focusing on the balanced accuracy performance metric. Figure 4: Examples of TOCFS (OCF and threshold) showing how positive and negative test instances are separated, on samplings randomly chosen from four datasets (from left to right: 73-Mushroom, 327-Phishing Websites, 545-Rice, 603-In-Vehicle Coupon Recommendation). of the ten ML models at hand. The results presented are coherent with those shown in Figure 2 (especially, the averaged balanced accuracy of is slightly greater than the one of Naive Bayes and much better than the one of Maj). Additionally, Figure 4 showcases some learned classifers (D, κ, τ) for some test sets in some randomly chosen dataset samplings. Each figure depicts the value distribution of all κ(ωX) for each test instance ωX from the sampling, represented as a blue (resp. red) dot when ϕωX |= y (resp. ϕωX |= y), and the green horizontal line corresponds to the threshold τ. It can be observed that the learned model separates quite accurately the positive instances from the negative ones for the first three samples, but clearly not for the last one (603-In-Vehicle Coupon Recommendation). 7 Conclusion In this paper, we showed how improvement operators, grounded in Belief Change Theory, can give rise to a family of online learning models, and we have presented learning and inference algorithms designed for this family. We empiri- cally evaluated the predictive performance of a specific model within this family using a range of datasets. While the results indicate that this simple model does not match the accuracy of more advanced models (e.g., neural nets), it performs comparably to all other methods in terms of recall. This makes it particularly suitable for applications where failing to detect positives is critical. Additionally, the model outperforms the trivial Majority class model and shows performance on par with Naive Bayes, suggesting a significant potential. This work opens up several avenues for future research. Theoretically, the next step is to analyze the properties of learning operators derived from improvement operators. Empirically, we plan to explore and evaluate additional models within the proposed family. For instance, while the learning algorithm used in the experiments is based on Hamming distance, other distances, such as the Mahalanobis distance, could be explored [Mahalanobis, 1936]. Because it accounts for feature correlations which could be measured on the training set, the Mahalanobis distance appears as a valuable candidate for improving model performance at inference. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Acknowledgements This work was supported in part by the following grants: JSPS KAKENHI Grant Numbers JP25K00375 and JP25K03190; JST CREST Grant Number JPMJCR22D3; the AI Chairs BE4mus IA (ANR-20-CHIA-0028) and EXPEKCTATION (ANR-19-CHIA-0005-01) funded by the French National Research Agency. References [Alchourrón et al., 1985] Carlos E. Alchourrón, Peter Gärdenfors, and David Makinson. On the logic of theory change: partial meet contraction and revision functions. Journal of Symbolic Logic, 50:510 530, 1985. [Baltag et al., 2011] Alexandru Baltag, Nina Gierasimczuk, and Sonja Smets. Belief revision as a truth-tracking process. In Proceedings of the 13th Conference on Theoretical Aspects of Rationality and Knowledge (TARK 11), pages 187 190, 2011. [Baltag et al., 2019] Alexandru Baltag, Nina Gierasimczuk, and Sonja Smets. Truth-tracking by belief revision. Studia Logica, 107(5):917 947, 2019. [Bloch and Lang, 2002] Isabelle Bloch and Jérôme Lang. Towards Mathematical Morpho-Logics, pages 367 380. Physica-Verlag HD, 2002. [Booth and Meyer, 2006] Richard Booth and Thomas A. Meyer. Admissible and restrained revision. Journal of Artificial Intelligence Research, 26:127 151, 2006. [Choi et al., 2009] Seung-Seok Choi, Sung-Hyuk Cha, and Charles Tappert. A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics, 8(1):43 48, 11 2009. [Darwiche and Pearl, 1997] Adnan Darwiche and Judea Pearl. On the logic of iterated belief revision. Artificial Intelligence, 89(1-2):1 29, 1997. [Gärdenfors, 1988] Peter Gärdenfors. Knowledge in flux. MIT Press, 1988. [Goldsmith et al., 2004] Judy Goldsmith, Robert H. Sloan, Balázs Szörényi, and György Turán. Theory revision with queries: Horn, read-once, and parity formulas. Artificial Intelligence, 156(2):139 176, 2004. [Jin and Thielscher, 2007] Yi Jin and Michael Thielscher. Iterated belief revision, revised. Artificial Intelligence, 171(1):1 18, 2007. [Katsuno and Mendelzon, 1991] Hirofumi Katsuno and Alberto O. Mendelzon. Propositional knowledge base revision and minimal change. Artificial Intelligence, 52:263 294, 1991. [Kelly, 1998] Kevin T. Kelly. The learning power of belief revision. In Proceedings of the 7th Conference on Theoretical Aspects of Rationality and Knowledge (TARK 98), pages 111 124, 1998. [Kelly, 2014] Kevin T. Kelly. A Computational Learning Semantics for Inductive Empirical Knowledge, pages 289 337. Springer International Publishing, 2014. [Konieczny and Pino Pérez, 2008] Sébastien Konieczny and Ramon Pino Pérez. Improvement operators. In Proceedings of the 11th International Conference on Principles of Knowledge Representation and Reasoning (KR 08), pages 177 187, 2008. [Konieczny et al., 2010] Sébastien Konieczny, Mattia Medina Grespan, and Ramon Pino Pérez. Taxonomy of improvement operators and the problem of minimal change. In Proceedings of the 12th International Conference on Principles of Knowledge Representation and Reasoning (KR 10), pages 161 170, 2010. [Mahalanobis, 1936] Prasanta Chandra Mahalanobis. On the generalized distance in statistics. Proceedings of the National Institute of Sciences of India, 2:49 55, 1936. [Medina Grespan and Pino Pérez, 2013] Mattia Medina Grespan and Ramón Pino Pérez. Representation of basic improvement operators. In Trends in Belief Revision and Argumentation Dynamics, pages 195 227. College Publications, 2013. [Pagnucco and Rajaratnam, 2005] Maurice Pagnucco and David Rajaratnam. Inverse resolution as belief change. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI 05), pages 540 545, 2005. [Pedregosa et al., 2011] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Mathieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. [Sauerwald and Beierle, 2019] Kai Sauerwald and Christoph Beierle. Decrement operators in belief change. In Proceedings of the 15th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty (ECSQARU 19), pages 251 262, 2019. [Schwind and Konieczny, 2020] Nicolas Schwind and Sébastien Konieczny. Non-prioritized iterated revision: Improvement via incremental belief merging. In Proceedings of the 17th International Conference on Principles of Knowledge Representation and Reasoning (KR 20), pages 738 747, 2020. [Schwind et al., 2022] Nicolas Schwind, Sébastien Konieczny, and Ramón Pino Pérez. On the representation of Darwiche and Pearl s epistemic states for iterated belief revision. In Proceedings of the 19th International Conference on Principles of Knowledge Representation and Reasoning (KR 22), 2022. [Schwind et al., 2023] Nicolas Schwind, Katsumi Inoue, and Pierre Marquis. Editing boolean classifiers: A belief change perspective. In Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI 23), pages 6516 6524, 2023. [Schwind et al., 2025] Nicolas Schwind, Katsumi Inoue, Sébastien Konieczny, and Pierre Marquis. Iterated belief Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) change as learning: Supplementary material. https: //github.com/nicolas-schwind/Iterated Belief Change ML, 2025. [Shalev-Shwartz and Ben-David, 2014] Shai Shalev Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. [Spohn, 1988] Wolfgang Spohn. Ordinal Conditional Functions: A Dynamic Theory of Epistemic States, pages 105 134. Springer Netherlands, 1988. [Williams, 1995] Mary-Anne Williams. Iterated theory base change: A computational model. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI 95), pages 1541 1549, 1995. [Wrobel, 1994] Stefan Wrobel. Concept formation and knowledge revision. Kluwer Academic, Netherlands, 1994. [Ye et al., 2016] Han-Jia Ye, De-Chuan Zhan, and Yuan Jiang. Instance specific metric subspace learning: A bayesian approach. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI 16), pages 2272 2278, 2016. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25)