# iterated_belief_change_as_learning__2dcc1c33.pdf

Iterated Belief Change as Learning

Nicolas Schwind1 , Katsumi Inoue2 , Sébastien Konieczny3 and Pierre Marquis3,4

1National Institute of Advanced Industrial Science and Technology, Tokyo, Japan 2National Institute of Informatics, Tokyo, Japan 3Univ. Artois, CNRS, CRIL, Lens, France 4Institut Universitaire de France nicolas-schwind@aist.go.jp, inoue@nii.ac.jp, konieczny@cril.fr, marquis@cril.fr

In this work, we show how the class of improvement operators a general class of iterated belief change operators can be used to deﬁne a learning model. Focusing on binary classiﬁcation, we present learning and inference algorithms suited to this learning model and we evaluate them empirically. Our ﬁndings highlight two key insights: ﬁrst, that iterated belief change can be viewed as an effective form of online learning, and second, that the well-established axiomatic foundations of belief change operators offer a promising avenue for the axiomatic study of classiﬁcation tasks.

1 Introduction Belief Change Theory [Alchourrón et al., 1985; Gärdenfors, 1988; Katsuno and Mendelzon, 1991] provides a principled framework for modifying an agent s current beliefs in response to new information. Iterated belief revision [Darwiche and Pearl, 1997; Jin and Thielscher, 2007; Booth and Meyer, 2006] extends this framework to accommodate sequences of new information, addressing the challenge of revising an agent s beliefs over time. In both cases, the ultimate goal is to improve the agent s beliefs to better reﬂect the real world. While the methodologies of these two approaches differ, their objective aligns with that of Machine Learning (ML): deriving an accurate approximation of the real world from data. Despite this conceptual similarity, connections between Belief Change Theory and ML remain largely unexplored, apart from a few notable contributions in philosophical logic [Kelly, 1998; Kelly, 2014; Baltag et al., 2011; Baltag et al., 2019], inductive logic programming [Wrobel, 1994; Pagnucco and Rajaratnam, 2005], and computational learning theory [Goldsmith et al., 2004]. A major difference is that primacy of update, which requires fully adopting new information after each revision, is a key principle in belief revision. However, this principle is incompatible with typical ML scenarios involving noisy data, as it leads to substantial changes in the agent s epistemic state at each learning step. Improvement operators [Konieczny and Pino Pérez, 2008; Konieczny et al., 2010], which generalize iterated belief revision, relax the primacy of update. Soft improvement oper-

ators [Konieczny and Pino Pérez, 2008], in particular, allow incremental changes that better reﬂect the iterative nature of learning. When the same information is encountered again, its reliability is slightly adjusted. This mirrors the behavior of online ML methods for classiﬁcation, where each labeled example causes gradual changes to the estimated probabilities of the classes. Figure 1 illustrates this analogy. Most ML methods also adjust examples similar to the labeled example, those near the observed instance. A comparable mechanism can be introduced into the iterated belief change framework: both the observed example and its neighbors can have their reliability updated through an improvement operator. This paper focuses on binary classiﬁcation and shows that improvement-based models built this way deliver reasonable learning performance. We compare them to standard ML methods on benchmark datasets. Results show the improvement-based model slightly outperforms Naive Bayes and achieves better recall than most existing methods. Thus, soft improvement operators offer a promising new approach to learning from examples. Although an initial exploration, this connection is important for both KR and ML communities. It links two fundamentally different tasks belief revision and supervised learning with distinct goals and methods. From the ML side, this work opens the way to learning models that emphasize interpretability and offer strong guarantees linking the model to the data. These properties are crucial for trustworthy AI but often lacking in current ML models. If rationality principles are established, they could serve as safeguards to ensure the model evolves correctly with new evidence. The proofs and code used to retrieve datasets and conduct experiments are available in [Schwind et al., 2025].

2 Preliminaries

We assume the reader familiar with basics of ML, including standard models (see, e.g., [Shalev-Shwartz and Ben-David, 2014] for an introduction). In this work, the focus is laid on tabular datasets, where instances are represented as vectors of Boolean features, aligning with many learning methods, especially in data mining. Such a Boolean encoding aids normalization, which often improves model performance. Although features in tabular data are usually not Boolean in

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Currently Learned Model

Labeled Instance

Online ML Model

Epistemic State

New Evidence

Soft Improvement Operator

Input Input

Figure 1: Analogies between classiﬁer learning models and belief change operators.

essence, they can be converted accordingly. Categorical features are transformed via standard one-hot encoding, creating a Boolean feature for each domain value. Numerical features are converted by selecting thresholds within their domain to produce Boolean features. These thresholds are chosen by analyzing the data distribution (e.g., percentiles) or by optimizing split metrics like Gini impurity or entropy, as commonly done in tree-based models such as decision trees, random forests, and boosted trees. Let L be a propositional language built from a ﬁnite set of propositional variables P, standard logical connectives, and the Boolean constants (false) and (true). A world on P is a mapping from P to {0, 1}, and Ωdenotes the set of all such worlds. A world ω satisﬁes a formula ϕ if ω makes ϕ true; the set of such worlds is denoted by [ϕ]. A formula is consistent if [ϕ] = , and it is complete if [ϕ] = {ω} for some ω Ω. The symbol |= denotes logical entailment, and logical equivalence: ϕ |= ψ iff [ϕ] [ψ], and ϕ ψ iff [ϕ] = [ψ]. Iterated Belief Change provides a principled framework for modeling how a rational agent s beliefs evolve with successive pieces of evidence. An agent s belief state, called an epistemic state, includes both the agent s current beliefs and conditional information that guides how beliefs should change in response to new inputs. Formally, an epistemic state can be any object Ψ, from which the agent s current beliefs are extracted via a mapping Bel, such that Bel(Ψ) L [Darwiche and Pearl, 1997]. An epistemic space is then a tuple E = E, Bel , where E is the set of all epistemic states in the space [Schwind et al., 2022]. A standard example of epistemic space is built with Ordinal Conditional Functions (OCFs) [Spohn, 1988; Williams, 1995]. An OCF κ is a mapping associating each world with a non-negative integer 1 such that κ(ω) = 0 for some world ω. Deﬁnition 1 (OCF epistemic space). The OCF epistemic space is the epistemic space Eocf = Eocf, Belocf where: Eocf is the set of all OCFs over Ω; Belocf is the mapping associating each OCF κ from Eocf with a formula ψ L such that [ψ] = {ω Ω| κ(ω) = 0}.

Given an epistemic space E = E, Bel , an iterated belief change operator on E is a mapping associating each

1In the original deﬁnition OCFs are deﬁned on ordinals [Spohn, 1988], but here, as in most cases, integers sufﬁce.

epistemic state Ψ E and each formula µ L with a new epistemic state Ψ µ E, i.e., : E L E. Improvement operators [Konieczny and Pino Pérez, 2008; Konieczny et al., 2010; Medina Grespan and Pino Pérez, 2013] generalize iterated revision operators [Darwiche and Pearl, 1997] by dropping the success postulate (R*1), which requires the revised beliefs to entail the input formula. They are deﬁned by nine rationality principles (I1) (I9) (see [Konieczny and Pino Pérez, 2008; Konieczny et al., 2010] for details). We recall only the weak primacy of update: (I1) k N s.t. Bel(Ψ k α) |= α, where Ψ 1α = Ψ α, and if k > 1, Ψ k α = (Ψ k 1α) α. This property states that after receiving α repeatedly, it eventually becomes believed. We now introduce a simple example of an improvement operator deﬁned over the OCF epistemic space: Deﬁnition 2 (Basic shifting operator +1). The basic shifting operator +1 on Eocf is deﬁned for any OCF κ and formula α by κ = κ +1 α, where for each world ω Ω:

κ (ω) = κ(ω) x if ω [α] κ(ω) + 1 x otherwise,

where x = 0 if Bel(κ) α |= , and x = 1 otherwise. Upon receiving new input α, the improvement operator +1 adjusts κ by increasing the value of worlds ω [ α] by 1 (i.e., κ (ω) = κ(ω) + 1), while leaving κ(ω) unchanged for worlds ω [α]. A normalization step ( x) then ensures min{κ (ω) | ω Ω} = 0, as required by the deﬁnition of an OCF. This operator resembles Spohn s n-conditionalization [Spohn, 1988], except n here depends on the prior plausibility of α (shifted by 1). It also relates to the one-improvement operator in [Konieczny and Pino Pérez, 2008].

2.1 Morphological Dilation and Erosion An essential component for deﬁning our learning operators is the notion of formula dilation and, dually, formula erosion [Bloch and Lang, 2002]. Both rely on a neighborhood B, which is a mapping associating each world ω Ωwith a set of worlds Bω Ωsuch that (i) ω Bω, and (ii) ω Bω implies that ω Bω , for all worlds ω, ω Ω. A neighborhood B induces a dilation δB and an erosion ϵB, both mappings from formulas to formulas, deﬁned for each ϕ L as follows: [δB(ϕ)] = {ω Ω| Bω [ϕ] = } [ϵB(ϕ)] = {ω Ω| Bω [ϕ]} For k 0, the k-dilation δk B(ϕ) and k-erosion ϵk B(ϕ) are deﬁned inductively as follows: δ0 B(ϕ) = ϕ, and δk B(ϕ) = δB(δk 1 B (ϕ)) for k > 0. Similarly, ϵ0 B(ϕ) = ϕ, and ϵk B(ϕ) = ϵB(ϵk 1 B (ϕ)) for k > 0. Neighborhoods can be derived from various similarity or distance measures on Boolean vectors (see [Choi et al., 2009] for an overview of such measures). In the present setting, the neighborhood Bω of each world ω depends solely on ω and need not be deﬁned uniformly over Ω. Leveraging such instance-speciﬁc neighborhoods is known to be beneﬁcial from a learning perspective [Ye et al., 2016].

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

3 From Epistemic Spaces to Classiﬁer Spaces

In this section, we formalize the change operation underlying the learning process of a binary classiﬁer, drawing on the framework of iterated belief change introduced earlier. The binary classiﬁcation task consists in predicting whether a given instance is a member of some target class or concept. We assume that each instance is described by a vector of values associating each feature xi from a given ﬁnite and ﬁxed set X with a value. We focus on binarized instance descriptions, so that the feature set X consists of a set of binary features. An instance can be associated with an output (1 or 0), characterizing whether the instance is predicted as positive or negative. We represent each binary feature as a propositional variable, so that the feature set X corresponds to the set PX = {x1, . . . , xn}. A world over PX is called an instance description. Let ΩX be the set of all such instance descriptions, i.e., all worlds over PX, and let LX denote the propositional language generated from PX using the standard connectives. Each formula ϕ LX can represent a concept predicted by a binary classiﬁer: an instance description ωX ΩX satisﬁes ϕ if and only if it is classiﬁed as positive. This modeling approach follows that of [Schwind et al., 2023]. To represent labeled data, as in training examples, we extend PX with an additional output (class) variable y indicating whether an instance is positive (1) or negative (0). Formally, we deﬁne P = PX {y}, with y / PX. A world ω over P is called a labeled instance, and Ωdenotes the set of all such labeled instances. A labeled instance ω is positive if ω(y) = 1 and negative if ω(y) = 0. For any labeled instance ω Ω, its feature description (i.e., the corresponding world over PX) is denoted ωX. The complete formula ϕω L such that [ϕω] = {ω} is called a training instance, with ω its associated labeled instance. Let Lc denote the set of all such training instances. A dataset D is deﬁned as a ﬁnite sequence of training instances: D = (ϕs ω)1 s m, and D denotes the set of all possible datasets.

To model binary classiﬁers, we move beyond treating a classiﬁer as a static propositional formula (as in [Schwind et al., 2023]) and instead adopt a dynamic perspective that captures the learning process itself. This shift mirrors the move from one-step belief change where an agent s epistemic state is represented by a single propositional formula [Katsuno and Mendelzon, 1991] to iterated belief change, where epistemic states have a richer structure and yield formulas via the mapping Bel [Darwiche and Pearl, 1997]. Similarly, modeling the evolution of a classiﬁer during learning requires a more expressive framework. To address this, the notion of an epistemic state is lifted to that of a binary classiﬁer state. Likewise, the mapping Bel, which assigns each epistemic state with a propositional formula representing the agent s beliefs, is lifted to a mapping Pos, which assigns each classiﬁer state with a formula characterizing the concept it predicts. The pair consisting of a classiﬁer state and the mapping Pos thus forms a classiﬁer space, in direct analogy with an epistemic space. The distinct notation Pos highlights the shift in perspective: from beliefs held by an agent to predictions made by a classiﬁer. Formally:

Deﬁnition 3 (Classiﬁer space). A classiﬁer space is a tuple E = E, Pos , where E is a set of binary classiﬁers and Pos is a mapping from E to LX.

Thus, the formula Pos(Ψ) represents the concept predicted by Ψ: the set [Pos(Ψ)] contains the instance descriptions that Ψ classiﬁes as positive, while [ Pos(Ψ)] contains those classiﬁed as negative. Note that Pos(Ψ) is not required to be a consistent formula, i.e., [Pos(Ψ)] may be empty. Standard epistemic spaces used for deﬁning improvement operators can be directly adapted to build classiﬁer spaces. Given an epistemic space E = E, Bel , one can construct a classiﬁer space E = E, Pos by simply setting Pos = Bel. However, more meaningful examples of classiﬁer spaces will be introduced in the next section. Based on classiﬁer spaces, we deﬁne an (online) learning operator:

Deﬁnition 4 (Learning operator). Let E = E, Pos be a classiﬁer space. A learning operator on E is a mapping : E Lc E.

Thus, speciﬁes how a classiﬁer Ψ E changes upon receiving a training instance ϕω Lc, yielding a new classiﬁer Ψ ϕω. The framework is general enough to capture any binary classiﬁer and online learning process, assuming binarized training data.

We now have the tools to deﬁne what we call a learning framework, representing a full learning process:

Deﬁnition 5 (Learning framework). A learning framework is a pair (Ψ , ), where Ψ E is a binary classiﬁer called the anchor, and is a learning operator on E = E, Pos .

The anchor Ψ serves as the initial classiﬁer state, providing the starting point for learning. Thus, a learning framework is fully determined by two components: the learning operator , which governs how classiﬁers evolve, and the anchor Ψ , which sets the initial state. Given a learning framework (Ψ , ) and a training dataset D D, the resulting learned classiﬁer, denoted Ψ D, is deﬁned inductively by:2

Ψ = Ψ Ψ (D (ϕω)) = (Ψ D) ϕω The choice of anchor depends on the context: it may be a classiﬁer pre-trained on prior data, an untrained classiﬁer (e.g., initialized as trivially positive, Pos(Ψ ) , or negative, Pos(Ψ ) ), or a random element from E. This formalization captures the iterative nature of online binary classiﬁer learning and leads into the next section, where we introduce a speciﬁc class of learning frameworks.

4 Improvement-Based Learning

In this section, we introduce a concrete class of improvementbased learning operators. These operators are deﬁned on a classiﬁer space that extends the OCF epistemic space. In this space, each binary classiﬁer is represented as a tuple (D, κ, τ), consisting of a training dataset D, an OCF κ, and a threshold τ R. Such classiﬁers are called TOCFS:

2 denotes vector concatenation.

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Deﬁnition 6 (TOCF classiﬁer space). The TOCF classiﬁer space is the tuple Etocf = Etocf, Postocf , where: Etocf = D Eocf R is the set of all TOCFS Postocf is the mapping associating each TOCF (D, κ, τ) Etocf with a formula ψ LX such that [ψ] = {ω ΩX | κ(ω) τ}.

In this setting, a binary classiﬁer is thus a TOCF Ψ = (D, κ, τ). The dataset D represents the training data used to construct Ψ. The OCF κ assigns a value to each instance description ωX ΩX, with lower values indicating greater plausibility that the instance is classiﬁed as positive by Ψ. The threshold τ speciﬁes the classiﬁcation boundary: an instance ωX is classiﬁed as positive (i.e., ωX [Pos(Ψ)]) if κ(ωX) τ, and as negative otherwise. Our concrete class of learning operators is deﬁned on the TOCF classiﬁer space. Every operator in this class is speciﬁed by the following components: an improvement operator on the OCF epistemic space, a neighborhood B on ΩX (see Sec. 2.1), used to characterize formula dilation and erosion, and a performance metric m : N4 R. In a nutshell, the improvement operator and the neighborhood B are used together to adjust the plausibility of worlds in the underlying OCF κ of a TOCF Ψ = (D, κ, τ), while the performance metric m sets the threshold τ that best separates predicted positive and negative instances. Given an improvement operator , a neighborhood B, and a performance metric m, we denote the corresponding learning operator by ( ,B,m). Then, for any TOCF Ψ = (D, κ, τ) and any training instance ϕω, the resulting TOCF Ψ = (D , κ , τ ) = Ψ ( ,B,m) ϕω is deﬁned as follows. First, we set D = D (ϕω), that is, we augment the current training dataset by including the new instance ϕω. Second, we build the new OCF κ using the improvement operator and the neighborhood B. This construction proceeds differently depending on whether the training instance ϕω is labeled as positive or negative. Learning from a positive instance. If the training instance ϕω satisﬁes ϕω |= y, the process unfolds as follows. Let ϕωX Lc X be the formula describing the instance (recall that ωX denotes its feature description). We begin by applying the improvement operator to ϕωX in κ, yielding κ ϕωX. We then continue this process iteratively on successive dilations of ϕωX: ﬁrst on δB(ϕωX), then on δ2 B(ϕωX), and so on, until reaching a ﬁxed point. Semantically, this process increases the plausibility of ω in κ, followed by an increase in the plausibility of its neighborhood, and then of neighborhoods of neighborhoods, so that the increase in plausibility assigned to a world becomes smaller as its distance from ω grows. Worlds that are not reachable from ω (that is, those not included in any dilation δk B(ϕωX) for any k) remain unaffected; their plausibility in κ stays the same. Formally, when ϕω |= y, we deﬁne the resulting OCF as κ = κ + ϕωX = κ n + ϕωX, where κ n + ϕωX is inductively deﬁned as follows:

κ 0 + ϕωX = κ,

κ k+1 + ϕωX = (κ k + ϕωX) δk B(ϕωX) for k 0,

n = min({k N | [δn+1 B (ϕωX)] {[δn B(ϕωX)], ΩX}}) Learning from a negative instance. Conversely, when the training instance is negative (ϕω |= y), the process targets ϕωX and its successive erosions: ϵB( ϕωX), ϵ2 B( ϕωX), and so on. At each step, plausibility is increased for the worlds satisfying the current erosion, continuing until reaching a ﬁxed point (or stopping just before if the ﬁxed point is an inconsistent formula). Semantically, this increases the plausibility of worlds that are farther from ωX under B, with the increase being greater the farther the world is. The world ωX itself remains unaffected. Formally, we deﬁne κ = κ ϕωX = κ n ϕωX, where κ n ϕωX is deﬁned inductively as follows: κ 0 ϕωX = κ,

κ k+1 ϕωX = (κ k ϕωX) ϵk B( ϕωX) for k 0,

n = min({k N | [ϵn+1 B ( ϕωX)] {[ϵn B( ϕωX)], }}) An alternative would be to handle negative instances using a decrement operator applied to ϕωX and its successive dilations, mirroring the treatment of positive instances. While such decrement operators have been proposed [Sauerwald and Beierle, 2019], we do not adopt them here because, unlike the well-established duality between erosion and dilation, no formal duality exists between improvement and decrement operators. By relying on erosion and dilation dual operations deﬁned over a single neighborhood B we maintain a uniﬁed learning mechanism using the same improvement operator for both positive and negative instances. Given the distinction outlined above between learning from positive and negative instances, the new OCF is deﬁned as:

κ = κ ϕω = κ + ϕωX if ϕω |= y, κ ϕωX if ϕω |= y

Third, we deﬁne the threshold τ , which acts as the optimal separator. Recall that τ partitions the space of instance descriptions such that instances with plausibility less than or equal to τ are classiﬁed as positive, and those with plausibility greater than τ are classiﬁed as negative. According to Def. 6, this corresponds to [Pos(Ψ )] = {ωX | κ (ωX) τ }. We aim to select τ so as to optimize performance over the dataset D , using the performance metric m. For any candidate threshold τ N, we compute a confusion matrix cm(τ) = (tp, fp, tn, fn) as follows: tp = {|ωX ΩX | ϕω D , ϕω |= y, κ (ωX) τ|} fp = {|ωX ΩX | ϕω D , ϕω |= y, κ (ωX) τ|} tn = {|ωX ΩX | ϕω D , ϕω |= y, κ (ωX) > τ|} fn = {|ωX ΩX | ϕω D , ϕω |= y, κ (ωX) > τ|} The ﬁnal threshold τ is selected as the median among all integer thresholds τ [0, up] having the maximum score m(cm(τ)), where up = arg max({κ (ωX) | ωX ΩX}):3

τ = median(arg max({m(cm(τ)) | τ N}))

3It sufﬁces to search in [0, up]: thresholds τ < 0 yield the same confusion matrix as τ = 0, and τ > up as τ = up.

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Having characterized both the new OCF κ and the threshold τ , we now summarize the complete deﬁnition of the improvement-based learning operator ( ,B,m): Deﬁnition 7 (Improvement-based learning operator ( ,B,m)). Let be an improvement operator, B be a neighborhood, and m be a performance metric. The improvement-based learning operator induced by , B, and m, and denoted by ( ,B,m), is deﬁned on the TOCF classiﬁer space Etocf as follows. For each binary classiﬁer (D, κ, τ) Etocf and each training instance ϕω Lc, we deﬁne (D, κ, τ) ( ,B,m) ϕω = (D , κ , τ ), where: ( D = D (ϕω) κ = κ ϕω τ = arg max({m(cm(τ)) | τ N})

To deﬁne the full learning framework (cf. Def. 5), we must specify an anchor serving as an initial state. We choose Ψ = ( , κ , 0), where κ is the constant OCF assigning 0 to every world ωX ΩX. Hence, the full improvement-based learning framework is given by the pair (Ψ , ( ,B,m)).

5 An Instantiated Learning Operator We now present a concrete instantiation of an improvementbased learning operator, ( +1,BH,mba). This operator is fully deﬁned by the following choices: +1, the basic shifting improvement operator (cf. Sec. 2) BH, that deﬁnes pairs of worlds as direct neighbors when they differ on at most one variable, i.e., for each ωX ΩX:

BH ωX = {ω X | |{xi PX | ω X(xi) = ωX(xi)}| 1}

mba, the balanced accuracy metric, deﬁned for each confusion matrix (tp, fp, tn, fn) as:

mba(tp, fp, tn, fn) = 1/2(tp/(tp + fn) + tn/(tn + fp))

The choice of balanced accuracy is illustrative rather than principled. Nonetheless, it is particularly appropriate for imbalanced datasets, where one class outweighs the other. By giving equal importance to the true positive rate and the true negative rate, balanced accuracy ensures a more equitable evaluation of classiﬁer performance across both classes. Representation of (Ψ , ( +1,BH,mba)). We now show that, in this speciﬁc learning framework, both a training algorithm and an inference algorithm (predicting the label of any instance ωX ΩX from a trained classiﬁer) can be designed. Both algorithms run in polynomial time with respect to the number of training instances and the number of features. Given an instance description ωX ΩX, let ωX ΩX be deﬁned as ωX(xi) = 1 ωX(xi), for each xi PX. Let d H : ΩX ΩX N be the Hamming distance between instance descriptions, i.e., for all ωX, ω X ΩX,

d H(ωX, ω X) = |{xi PX | ωX(xi) = ω X(xi)}|.

We extend the Hamming distance to deﬁne a distance between any instance description ω X and dataset D as d H(ω X, ) = 0, and if D = :

d H(ω X, D) = X {d H(ω X, ϕω) | ϕω D}, with

d H(ω X, ϕω) = d H(ω X, ωX) if ϕω |= y d H(ω X, ωX) if ϕω |= y

On the other hand, given any dataset D, let ωD X ΩX be any instance description such that for each xi PX:

ωD X(xi) = 1 if D1(xi) D0(xi) 0 otherwise,

where D1(xi) = |{ϕω D | ϕω |= y iff ω(xi) = 1}| and D0(xi) = m D1(xi) (recall that m is the number of training instances from D.) We can show that such an instance description ωD X is one of the closest instance description to the dataset D in terms of Hamming distance:4

Lemma 1. For each dataset D D and each instance description ωX ΩX, we have d H(ωD X, D) d H(ωX, D). Then, we can show that for each instance description ωX ΩX the value κ(ωX) can always be characterized via computations of Hamming distances to the currently trained dataset: Proposition 1. For each TOCF Ψ = (D, κ, τ) and each instance description ωX ΩX, we have that:

κ(ωX) = d H(ωX, D) d H(ωD X, D)

This has several important implications for our learning framework (Ψ , ( +1,BH,mba)) in terms of computational complexity. Recalling that n is the size of the feature set and m is the number of training instances from D, we have: Proposition 2. Let D D, and let Ψ ( +1,BH,mba) D = Ψ = (D, κ, τ). 1. τ can be computed in time O(n m2) 2. given (D, τ), for each ωX ΩX, deciding whether ωX [Pos(Ψ)] can be done in time O(n m)

As a consequence of Proposition 2, in the learning framework (Ψ , ( +1,BH,mba)), we obtain both a learning algorithm (the computation of τ) and an inference algorithm (the prediction for any instance ωX ΩX given (D, τ)) that run in polynomial time with respect to the number of training instances and the number of features (Proposition 2.1 and 2.2, respectively). Since inference can be performed in polynomial time from (D, τ) alone, the learned classiﬁer (D, κ, τ) is fully characterized by (D, τ), i.e., the OCF κ is not needed. Another noteworthy property of this learning framework is its robustness to dataset permutations. Given any dataset D = (ϕs ω)1 s m, let Dπ = (ϕπ(s) ω )1 s m denote the dataset obtained by applying a permutation π : {1, . . . , m} {1, . . . , m}: Proposition 3. For any permutation π : {1, . . . , m} {1, . . . , m} and any dataset D D,

Ψ ( +1,BH,mba) D = Ψ ( +1,BH,mba) Dπ

4All proofs are available online [Schwind et al., 2025].

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Figure 2: A comparison of performance between all learning models across all datasets.

In other words, the learned classiﬁer is invariant under reordering of the training instances. This is signiﬁcant, as many standard learning models, such as decision trees, random forests, and k-nearest neighbors (k-NN), may produce different outputs depending on the order in which the data is processed. This corresponds to the commutativity postulate for improvement operators [Schwind and Konieczny, 2020], which states that no piece of information should be treated as more important than another when the order of input carries no meaningful priority.

6 Experiments

Empirical protocol. We implemented in Python the improvement-based learning model ( +1,BH,mba) (simply denoted by onward). Its predictive performance was compared against the following standard ML models (learned using the scikit-learn library [Pedregosa et al., 2011] and considering default parameters): Logistic Regression, Naive Bayes, Decision Tree, Gradient Boosting, KNN, Neural Network, Random Forest, and SVM. In addition, a trivial learning model (named Maj) always predicting the majority class (positive or negative) in the training dataset was considered as a baseline. The evaluation considered a large set of performance metrics that are standard in ML: balanced accuracy, accuracy, F1 score, Jaccard index, Matthews correlation coefﬁcient (MCC), precision, recall, and ROC-AUC (receiver operating characteristic area under the curve). The experimental protocol involved selecting 58 binary classiﬁcation datasets from the UCI repository,5 with each dataset containing up to 12,684 instances and up to 1,203 numerical or categorical features. A 10-fold cross-validation has been conducted: each dataset was split into ten random samplings, with a 90%/10% division for training and test sets. Missing values in the data were imputed by ﬁlling each numerical feature with the mean value from the dataset and each

5https://archive.ics.uci.edu/datasets/

categorical feature with the most frequent value. The datasets were then standardized using a standard scaler. For our learning framework , numerical attributes were further re-scaled linearly to the interval [0, 10] with integer values, while categorical attributes were left unchanged. For these experiments, binarizing the features was not mandatory: instead, we used a modiﬁed weighted Hamming distance between instance descriptions d H(ωX, ω X). This was made only for convenience as this is equivalent in terms of performance and computational complexity to binarizing all datasets in such a way to ensure that d H = d H.

Empirical results. We did an extensive comparative analysis which is visualized through various plots (the full set is available in [Schwind et al., 2025]). Box plots were generated for each metric, showing model performance across the 58 datasets, scatter plots compared the proposed model with each baseline, and spider plots have been used to compare the performance of all models for individual datasets and in an aggregate view. We present some of them hereafter. The spider plots in Figure 2 show in a synthetic way the average predictive performances (over the 58 datasets) of the ten ML models learned for each of the ten families of models considered in our experiments and assessed for each of the eight metrics (our learning framework is simply named OCF in the ﬁgures). One can see that outperforms the baseline Maj model (which is mandatory to be considered as a signiﬁcant learning operator). It turns out that in practice achieves performances that are similar to the ones of Naive Bayes. Finally, performs slightly better than the other models when recall is used to measure the predictive performance. This is an important point since high recall is expected in applications where missing a positive case can be costly, such as in healthcare or when dealing with attack detection. Focusing on the balanced accuracy metric, Figure 3 (left) presents a scatter plot for contrasting in a more precise way the predictive performances of and Naive Bayes; and Figure 3 (right) presents boxplots representing the distributions of predictive performance achieved over all datasets by each

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Figure 3: A comparison between our learning model and the Naive Bayes learning model across all datasets (left), and a comparison of performance between all learning models across all datasets (right), both focusing on the balanced accuracy performance metric.

Figure 4: Examples of TOCFS (OCF and threshold) showing how positive and negative test instances are separated, on samplings randomly chosen from four datasets (from left to right: 73-Mushroom, 327-Phishing Websites, 545-Rice, 603-In-Vehicle Coupon Recommendation).

of the ten ML models at hand. The results presented are coherent with those shown in Figure 2 (especially, the averaged balanced accuracy of is slightly greater than the one of Naive Bayes and much better than the one of Maj). Additionally, Figure 4 showcases some learned classifers (D, κ, τ) for some test sets in some randomly chosen dataset samplings. Each ﬁgure depicts the value distribution of all κ(ωX) for each test instance ωX from the sampling, represented as a blue (resp. red) dot when ϕωX |= y (resp. ϕωX |= y), and the green horizontal line corresponds to the threshold τ. It can be observed that the learned model separates quite accurately the positive instances from the negative ones for the ﬁrst three samples, but clearly not for the last one (603-In-Vehicle Coupon Recommendation).

7 Conclusion In this paper, we showed how improvement operators, grounded in Belief Change Theory, can give rise to a family of online learning models, and we have presented learning and inference algorithms designed for this family. We empiri-

cally evaluated the predictive performance of a speciﬁc model within this family using a range of datasets. While the results indicate that this simple model does not match the accuracy of more advanced models (e.g., neural nets), it performs comparably to all other methods in terms of recall. This makes it particularly suitable for applications where failing to detect positives is critical. Additionally, the model outperforms the trivial Majority class model and shows performance on par with Naive Bayes, suggesting a signiﬁcant potential.

This work opens up several avenues for future research. Theoretically, the next step is to analyze the properties of learning operators derived from improvement operators. Empirically, we plan to explore and evaluate additional models within the proposed family. For instance, while the learning algorithm used in the experiments is based on Hamming distance, other distances, such as the Mahalanobis distance, could be explored [Mahalanobis, 1936]. Because it accounts for feature correlations which could be measured on the training set, the Mahalanobis distance appears as a valuable candidate for improving model performance at inference.

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Acknowledgements This work was supported in part by the following grants: JSPS KAKENHI Grant Numbers JP25K00375 and JP25K03190; JST CREST Grant Number JPMJCR22D3; the AI Chairs BE4mus IA (ANR-20-CHIA-0028) and EXPEKCTATION (ANR-19-CHIA-0005-01) funded by the French National Research Agency.

References [Alchourrón et al., 1985] Carlos E. Alchourrón, Peter Gärdenfors, and David Makinson. On the logic of theory change: partial meet contraction and revision functions. Journal of Symbolic Logic, 50:510 530, 1985. [Baltag et al., 2011] Alexandru Baltag, Nina Gierasimczuk, and Sonja Smets. Belief revision as a truth-tracking process. In Proceedings of the 13th Conference on Theoretical Aspects of Rationality and Knowledge (TARK 11), pages 187 190, 2011. [Baltag et al., 2019] Alexandru Baltag, Nina Gierasimczuk, and Sonja Smets. Truth-tracking by belief revision. Studia Logica, 107(5):917 947, 2019. [Bloch and Lang, 2002] Isabelle Bloch and Jérôme Lang. Towards Mathematical Morpho-Logics, pages 367 380. Physica-Verlag HD, 2002. [Booth and Meyer, 2006] Richard Booth and Thomas A. Meyer. Admissible and restrained revision. Journal of Artiﬁcial Intelligence Research, 26:127 151, 2006. [Choi et al., 2009] Seung-Seok Choi, Sung-Hyuk Cha, and Charles Tappert. A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics, 8(1):43 48, 11 2009. [Darwiche and Pearl, 1997] Adnan Darwiche and Judea Pearl. On the logic of iterated belief revision. Artiﬁcial Intelligence, 89(1-2):1 29, 1997. [Gärdenfors, 1988] Peter Gärdenfors. Knowledge in ﬂux. MIT Press, 1988. [Goldsmith et al., 2004] Judy Goldsmith, Robert H. Sloan, Balázs Szörényi, and György Turán. Theory revision with queries: Horn, read-once, and parity formulas. Artiﬁcial Intelligence, 156(2):139 176, 2004. [Jin and Thielscher, 2007] Yi Jin and Michael Thielscher. Iterated belief revision, revised. Artiﬁcial Intelligence, 171(1):1 18, 2007. [Katsuno and Mendelzon, 1991] Hirofumi Katsuno and Alberto O. Mendelzon. Propositional knowledge base revision and minimal change. Artiﬁcial Intelligence, 52:263 294, 1991. [Kelly, 1998] Kevin T. Kelly. The learning power of belief revision. In Proceedings of the 7th Conference on Theoretical Aspects of Rationality and Knowledge (TARK 98), pages 111 124, 1998. [Kelly, 2014] Kevin T. Kelly. A Computational Learning Semantics for Inductive Empirical Knowledge, pages 289 337. Springer International Publishing, 2014.

[Konieczny and Pino Pérez, 2008] Sébastien Konieczny and Ramon Pino Pérez. Improvement operators. In Proceedings of the 11th International Conference on Principles of Knowledge Representation and Reasoning (KR 08), pages 177 187, 2008. [Konieczny et al., 2010] Sébastien Konieczny, Mattia Medina Grespan, and Ramon Pino Pérez. Taxonomy of improvement operators and the problem of minimal change. In Proceedings of the 12th International Conference on Principles of Knowledge Representation and Reasoning (KR 10), pages 161 170, 2010. [Mahalanobis, 1936] Prasanta Chandra Mahalanobis. On the generalized distance in statistics. Proceedings of the National Institute of Sciences of India, 2:49 55, 1936. [Medina Grespan and Pino Pérez, 2013] Mattia Medina Grespan and Ramón Pino Pérez. Representation of basic improvement operators. In Trends in Belief Revision and Argumentation Dynamics, pages 195 227. College Publications, 2013. [Pagnucco and Rajaratnam, 2005] Maurice Pagnucco and David Rajaratnam. Inverse resolution as belief change. In Proceedings of the 19th International Joint Conference on Artiﬁcial Intelligence (IJCAI 05), pages 540 545, 2005. [Pedregosa et al., 2011] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Mathieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. [Sauerwald and Beierle, 2019] Kai Sauerwald and Christoph Beierle. Decrement operators in belief change. In Proceedings of the 15th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty (ECSQARU 19), pages 251 262, 2019. [Schwind and Konieczny, 2020] Nicolas Schwind and Sébastien Konieczny. Non-prioritized iterated revision: Improvement via incremental belief merging. In Proceedings of the 17th International Conference on Principles of Knowledge Representation and Reasoning (KR 20), pages 738 747, 2020. [Schwind et al., 2022] Nicolas Schwind, Sébastien Konieczny, and Ramón Pino Pérez. On the representation of Darwiche and Pearl s epistemic states for iterated belief revision. In Proceedings of the 19th International Conference on Principles of Knowledge Representation and Reasoning (KR 22), 2022. [Schwind et al., 2023] Nicolas Schwind, Katsumi Inoue, and Pierre Marquis. Editing boolean classiﬁers: A belief change perspective. In Proceedings of the 37th AAAI Conference on Artiﬁcial Intelligence (AAAI 23), pages 6516 6524, 2023. [Schwind et al., 2025] Nicolas Schwind, Katsumi Inoue, Sébastien Konieczny, and Pierre Marquis. Iterated belief

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

change as learning: Supplementary material. https: //github.com/nicolas-schwind/Iterated Belief Change ML, 2025. [Shalev-Shwartz and Ben-David, 2014] Shai Shalev Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. [Spohn, 1988] Wolfgang Spohn. Ordinal Conditional Functions: A Dynamic Theory of Epistemic States, pages 105 134. Springer Netherlands, 1988. [Williams, 1995] Mary-Anne Williams. Iterated theory base change: A computational model. In Proceedings of the 14th International Joint Conference on Artiﬁcial Intelligence (IJCAI 95), pages 1541 1549, 1995. [Wrobel, 1994] Stefan Wrobel. Concept formation and knowledge revision. Kluwer Academic, Netherlands, 1994. [Ye et al., 2016] Han-Jia Ye, De-Chuan Zhan, and Yuan Jiang. Instance speciﬁc metric subspace learning: A bayesian approach. In Proceedings of the 30th AAAI Conference on Artiﬁcial Intelligence (AAAI 16), pages 2272 2278, 2016.

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)