# targeting_minimal_rare_itemsets_from_transaction_databases__e322aef5.pdf Targeting Minimal Rare Itemsets from Transaction Databases Amel Hidouri1 , Badran Raddaoui2,3 , Said Jabbour1 1 CRIL & CNRS, Universite d Artois, Lens, France 2 SAMOVAR, T el ecom Sud Paris, Institut Polytechnique de Paris, France 3Institute for Philosophy II, Ruhr University Bochum, Germany {hidouri, jabbour}@cril.fr, badran.raddaoui@telecom-sudparis.eu The computation of minimal rare itemsets is a wellknown task in data mining, with numerous applications, e.g., drugs effects analysis and network security, among others. This paper presents a novel approach to the computation of minimal rare itemsets. First, we introduce a generalization of the traditional minimal rare itemset model called k-minimal rare itemset. A k-minimal rare itemset is defined as an itemset that becomes frequent or rare based on the removal of at least k or at most (k 1) items from it. We claim that our work is the first to propose this generalization in the field of data mining. We then present a SAT-based framework for efficiently discovering k-minimal rare itemsets from large transaction databases. Afterwards, by partitioning the k-minimal rare itemset mining problem into smaller sub-problems, we aim to make it more manageable and easier to solve. Finally, to evaluate the effectiveness and efficiency of our approach, we conduct extensive experimental analysis using various popular datasets. We compare our method with existing specialized algorithms and CP-based algorithms commonly used for this task. 1 Introduction Pattern extraction is a core topic in data mining. It aims to automatically infer knowledge from data based on various kinds of interesting measures. This involves the identification of relevant patterns embedded in data. These patterns represent the properties of the data satisfying users preferences. Frequent Itemset Mining (FIM, for short) is a typical task of data analysis area that aims to find objects that frequently occur together among data. The first original motivation for FIM was found in market basket analysis, which involves exploring transactions in retails to discover frequently co-occurring products. Earlier work has often focused on mining frequent patterns (see [Fournier-Viger et al., 2017] for a detailed survey). However, in some cases certain items have a much lower chance to frequently appear than others. Hence, patterns appearing rarely, i.e., itemsets having a low frequency in the database, are overlooked since they are considered uninteresting and eliminated using the frequency constraint. In contrast, it has been shown recently that in many real-life applications, targeting infrequent itemsets, called rare itemsets, may be more informative than discovering frequently occurring patterns [Darrab et al., 2021]. For instance, in the medical field, infrequent patterns may be more appealing because their discovery would aid in the avoidance of negative consequences and the formulation of healthcare decisions. During the Covid-19 pandemic, one might be interested for example in analyzing clinical data to discover unusual combinations of disease symptoms and risk factors that have not previously been identified. These patterns could be used as surrogate indicators to identify patients with Covid-19 and predict their clinical outcomes, so that appropriate treatments can be provided. Another example of using rare itemsets is that in the banking field where finding irregular behaviors could be helpful for identifying potentially fraudulent transactions. The problem of mining rare itemsets from transaction databases is quite challenging since they are practically much more numerous than frequent ones. To overcome this issue, a condensed representation of rare itemsets, coined minimal rare itemsets, has been considered. Minimal rare itemsets are of interest since they represent a border below which all subsets are frequent itemsets [Mannila and Toivonen, 1997]. Some algorithms have been proposed for computing minimal rare itemsets in transaction databases (e.g., [Szathmary et al., 2007; Szathmary et al., 2012; Belaid et al., 2019]). In this paper, we are primarily interested in generalizing the concept of minimal rare itemset, and then computing these motifs in large transaction databases. In fact, the frequency of some itemsets cannot exceed the minimum frequency threshold due to the presence of some items. As a result, these itemsets are regarded as rare. In this case, it may be more interesting to focus on finding this kind of itemsets while relaxing the fequency constraint over a fixed number of items. This representation will be referred to as k-minimal rare itemsets. In more detail, a k-minimal rare itemset X is an itemset that becomes frequent (resp. rare) when removing at least (resp. most) k (resp. (k 1)) items from it. Clearly, a 1-minimal rare itemset is a minimal rare itemset. Interestingly, such generalization could aid in the discovery of what may be relevant in data in real applications. In fact, k-minimal rare itemsets can be used to identify errors in certain processes. These item- Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) sets, for example, aid in the discovery of the causes of abnormal effects of drugs taken by patients in the medical field. Moreover, such patterns can assist a retailer in identifying irrelevant purchased products, allowing her/him to provide the appropriate product combinations. A retail store manager can use this knowledge to assess the risk of selling a product by removing it from the market. Our Contributions. In this paper, we first present a generalization of the minimal rare itemset model, which we call k-minimal rare itemset. Note that our framework is general enough to encompass minimal rare itemsets as a specific case. Then, we provide a symbolic Artificial Intelligence (AI) approach to discovering k-minimal rare itemsets from transaction databases. In detail, we present a SAT-based framework for translating the task of finding k-minimal rare itemsets into a propositional satisfiability problem that can be solved using a SAT solver. To tackle the scalability issue, we provide afterwards a splitting-based approach enabling to partition the database into subsets with lower size that can be independently mined without jeopardizing completeness. Finally, we conduct extensive experiments on different popular real-world datasets to evaluate the efficiency of our approach w.r.t. the state-of-the-art methods. 2 General Setting 2.1 Propositional Satisfiability Let L be a propositional language built up inductively from a countable set P of propositional variables, the Boolean constants (true or 1) and (false or 0) and the classical logical connectives { , , , , } in the usual way. To range over the elements of P, we use the letters x, y, z, and so on. A literal is a propositional variable (x) of P or its negation ( x). A clause is a (finite) disjunction of literals. Propositional formulas of L are denoted by Φ, Ψ, etc. For any formula Φ from L, P(Φ) denotes the symbols of P occurring in Φ. A formula in conjunctive normal form (or simply CNF) is a finite conjunction of clauses. A Boolean interpretation I of a CNF formula Φ is defined as a function from P(Φ) to {0, 1}. A model of Φ is an interpretation I that satisfies Φ (denoted I |= Φ), that is, if there exists an interpretation I : P(Φ) {0, 1} that satisfies all clauses in Φ. The formula Φ is satisfiable if it has at least one model. We write M(Φ) as shorthand for the set of models of a formula Φ. The propositional satisfiability problem (in short, SAT) is the decision problem for propositional logic. Given a CNF formula Φ, SAT determines whether there exists a model of Φ. SAT solvers have been used in a variety of real-world scenarios, e.g., electronic design automation, software and hardware verification [Morgado and Marques-Silva, 2005], etc. Note that the aforementioned applications require the enumeration of all possible models of the corresponding CNF formula. This task, known as model enumeration, is of great importance. For instance, in diagnosis, the user is interested in finding all possible explanations rather than just whether one exists. Model enumeration problems find applications in a variety of tasks, including network verification [Zhang et al., 2012], model checking [Hu and Martin, 2004] as well as several data mining tasks, e.g., [Dlala et al., 2018; Jabbour et al., 2018; Boudane et al., 2016; Izza et al., 2020; Hidouri et al., 2021b; Hidouri et al., 2021a; Hidouri et al., 2022], and graph analysis [Jabbour et al., 2016; Jabbour et al., 2017; Jabbour et al., 2022]. In the last decade, several SAT-based proposals for modeling and solving pattern mining problems have indeed studied. Because they rely on generic and declarative solvers, these approaches are suitable for modeling a variety of mining tasks. 2.2 Overview of the Rare Itemsets Let Ωbe a universe of items (or symbols) that may represent articles in a supermarket, web pages, or a collection of attributes or events. We denote the elements of Ωby the letters a, b, c, etc. An itemset (or simply a motif) is a subset of items in Ω, that is, X Ω. The set of all itemsets over Ω, denoted as 2Ω, are represented by the capital letters X, Y, Z, etc. A transaction database is a finite nonempty set of transactions or records D = {T1, T2, . . . , Tm}. We denote by Da the sub-base containing the set of transactions where a appears. Given a transaction database D and an itemset X, the cover of X in D is a mapping 2X 2D which maps each itemset X to a set of transactions in D containing X. More formally, Cover(X, D) = {i [1..m] | Ti D and X Ti}. The cardinality of the cover of the itemset X represents its support (also called frequency). We write Supp(X, D) for the support of X in the dataset D, i.e., Supp(X, D) = |Cover(X, D)|. When there is no confusion, the cover and support will be simply denoted as Cover(X) and Supp(X), respectively. We also write X 1. If X is a k-MRI in D, then for all x X, X \ {x} is a (k 1)-MRI in D. Proposition 2 Given a transaction database D and a minimum support threshold λ, if X is a 1-MRI in D with X = Y U Z, then Z is a 1-MRI in {Ti D | Y Ti}. 3 SAT-based Encoding for k-MRIs Mining This section presents our SAT-based encoding for the kminimal rare itemset enumeration problem. Given a transac- tion database D, our key idea is to encode the task of k-MRIs computation into a propositional formula whose models correspond exactly to the set of k-minimal rare itemsets of D. In this view, the problem of mining such motifs is translated to the one of computing all models of a propositional formula. We start by providing a SAT encoding for the enumeration of 1-MRIs before proposing a generalization for the k-MRIs mining problem. For that purpose, to establish a one-to-one mapping between the models of our SAT-based encoding and the set of 1-MRIs in the database D, each item a Ωis associated with a propositional variable xa where xa is true iff a belongs to the candidate rare itemset, while each transaction Ti in D is associated with two propositional variables pi and qi where pi (resp. qi) is true iff the transaction Ti contains the candidate rare itemset (resp. the candidate rare itemset except one item). Now, in order to establish a mapping between the set of 1-MRIs in the database D and the models of the corresponding propositional formula, we introduce the following logical constraints. Cover constraint. The first constraint allows to capture the cover of the itemset and it can be expressed as follows: a Ω\Ti xa)) (1) Constraint (1) means that the itemset appears in the transaction Ti if all items that do not occur in that itemset are set to false. Infrequency constraint. The second constraint ensures that the candidate itemset is not frequent in D. X Ti D pi < λ (2) Given a minimum support threshold λ, Constraints (1) and (2) allow to find the set of all rare itemsets in D. Now, in order to find a 1-MRI, one needs to identify the set of transactions matching X in D as well as those that match X\{a} for all a X. This can be expressed by the following constraint. Generator constraint. This constraint allows to catch the set of transactions in the database D involving the whole itemset except one item. Ti D (qi ( X a Ω\Ti xa = 1)) (3) In other words, enforcing a rare itemset X to be minimal is equivalent to enforcing the itemset X \ {a} ( a X) to be frequent. This requirement can be modeled with the next constraint. Frequency constraint. This constraint requires that each itemset X \ {a} ( a X) must be frequent in D. a Ω (xa ( X Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) Example 3 Consider again Table 1 and let λ = 3. Then, the cover, infrequency, generator and frequency constraints can be written as given in Figure 2. Property 1 Each model of the formula Φλ D = (1) (2) (3) (4) corresponds to a 1-MRI in the database D. The next result shows that the formula Φλ D entails the following set of clauses. Proposition 3 For a given transaction database D, we have Φλ D |= Θ: It is worth noting that Constraint (5) can be useful for unit propagation. In fact, this constraint enforces the rare itemset to be a generator. Indeed, it expresses that there exists a transaction in which X \{a} appears for each a X. As this property holds for every a X, then X is a generator. Interestingly, the previous encoding can be generalized to deal with the problem of mining the set of all k-MRIs from a given transaction database. To do so, let us present the following three logical constraints. k Ti D (pi,j ( X a Ω\Ti xa = j)) (6) X Ω |X|=k 1 Ti D |(Ω\Ti) X|=j pi,j < λ))) (7) Ti D |(Ω\Ti) X|=j pi,j λ))) (8) Let us now go over these constraints in greater depth. To compute the set of all k-MRIs in D, we need first to identify the transactions involving the whole itemset (i.e., pi,0), those containing the itemset except one item (i.e., pi,1), until those with k missing items (i.e., pi,k). Constraint (6) enables the identification of such a set of transactions. Next, Constraint (7) requires the deletion of (k 1) items from the candidate itemset in order for it to be rare. This requires an additional constraint for each subset of items from Ωof size (k 1). Note that in this case the number of constraints can be exponential w.r.t. k. Lastly, to allow the removal of k items, we need to identify the transactions containing partially the itemset (i.e., the candidate itemset without k items). This condition can be modelled with Constraint (8). Notice that for k = 1, the SAT encoding of 1-MRIs corresponds to Φλ D by substituting pi,0 (resp. pi,1) by pi (resp. qi). Also, it is worthy to note that Constraint (6) is a cardinality constraint of the form Pn i=1 xi = k. Such constraint can be rewritten as follows: Ti D (pi,j ( 0 l 0, then the propositional formula Φk,λ D = (6) (7) (8) encodes the problem of mining k-MRIs in D. 4 Towards a More Efficient Computation of Rare Itemsets The practical SAT-based encoding of k-MRIs mining problem, defined in Section 3, is polynomial in k, |Ω|, and |D|. In particular, the number of propositional variables and clauses is bounded by O(|Ω| |D|2+|Ω|2 |D|) for 1-MRIs. Unfortunately, even if such complexity is polynomially bounded, it becomes challenging and time-consuming for large datasets. This is due especially to the cardinality constraint (4) (for 1MRIs) that involves all the transactions of D. Hence, the size of the whole encoding may be huge. To obviate this scalability issue, we utilize a decomposition scheme for better efficiency by avoiding the generation of large CNF formulas. We provide next an illustration through the 1-MRIs case, which can be generalized to k-MRIs (k > 1). In more details, the aim is to partition the initial mining problem into less complex and more manageable set of subproblems. The resolution task is then expected to be easier than that for the original problem. In other words, the search space is divided into disjoint parts by the use of the Shannon s principle [Zaki, 1999] (also referred to as guiding path [Zhang et al., 1996]). In fact, the guiding path is a set of formulas added to the original formula in order to split the search space. More formally, let Φ be a propositional formula and {Γ1, . . . , Γn} be a set of sub-formulas over P(Φ) s.t. Γ1 . . . Γn and Γi Γj |= , i = j. Then, the satisfiability of the formula Φ is related to the satisfiability of at least Φ Γ1 i n. Then, we have that M(Φ) = Un i=1 M(Φ Γi). In [Jabbour et al., 2020], the authors defined Γi as Γi = Φ xai V 1 j 2. Furthermore, we plan to investigate more advanced and powerful partitioning techniques to improve the performance of our approach. 6Note that we didn t count clauses of formulas solved by unit propagation. We refer to that by in Table 2. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) Acknowledgments This research has received support from the European Union s Horizon research and innovation programme under the MSCA-SE (Marie Skłodowska-Curie Actions Staff Exchange) grant agreement 101086252; Call: HORIZONMSCA-2021-SE-01; Project title: STARWARS (STormw Ate R and Wastew Ate R network S heterogeneous data AI-driven management). References [Belaid et al., 2019] Mohamed-Bachir Belaid, Christian Bessiere, and Nadjib Lazaar. Constraint programming for mining borders of frequent itemsets. In IJCAI, pages 1064 1070, 2019. [Boudane et al., 2016] Abdelhamid Boudane, Sa ıd Jabbour, Lakhdar Sais, and Yakoub Salhi. A SAT-based approach for mining association rules. In IJCAI, pages 2472 2478, 2016. [Boudane et al., 2018] Abdelhamid Boudane, Sa ıd Jabbour, Lakhdar Sais, and Yakoub Salhi. SAT-based data mining. Int. J. Artif. Intell. Tools, 27(1):1 24, 2018. [Darrab et al., 2021] Sadeq Darrab, David Broneske, and Gunter Saake. Modern applications and challenges for rare itemset mining. Int J Mach Learn Comput, 11(3):208 218, 2021. [Dlala et al., 2018] Imen Ouled Dlala, Sa ıd Jabbour, Badran Raddaoui, and Lakhdar Sais. A parallel SAT-based framework for closed frequent itemsets mining. In CP, pages 570 587, 2018. [Fournier-Viger et al., 2017] Philippe Fournier-Viger, Jerry Chun-Wei Lin, Bay Vo, Tin Truong Chi, Ji Zhang, and Hoai Bac Le. A survey of itemset mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(4):e1207, 2017. [Hidouri et al., 2021a] Amel Hidouri, Sa ıd Jabbour, Imen Ouled Dlala, and Badran Raddaoui. On minimal and maximal high utility itemsets mining using propositional satisfiability. In IEEE International Conference on Big Data (Big Data), pages 622 628, 2021. [Hidouri et al., 2021b] Amel Hidouri, Sa ıd Jabbour, Badran Raddaoui, and Boutheina Ben Yaghlane. Mining closed high utility itemsets based on propositional satisfiability. Data Knowl. Eng., 136:101927, 2021. [Hidouri et al., 2022] Amel Hidouri, Sa ıd Jabbour, and Badran Raddaoui. On the enumeration of frequent high utility itemsets: A symbolic AI approach. In CP, pages 27:1 27:17, 2022. [Hu and Martin, 2004] Alan J Hu and Andrew K Martin. Formal Methods in Computer-Aided Design: 5th International Conference, FMCAD 2004, Austin, Texas, USA, November 15-17, 2004, Proceedings. Springer Science & Business Media, 2004. [Izza et al., 2020] Yacine Izza, Said Jabbour, Badran Raddaoui, and Abdelahmid Boudane. On the enumeration of association rules: A decomposition-based approach. In IJCAI, pages 1265 1271, 2020. [Jabbour et al., 2016] Sa ıd Jabbour, Nizar Mhadhbi, Abdesattar Mhadhbi, Badran Raddaoui, and Lakhdar Sais. Summarizing big graphs by means of pseudo-boolean constraints. In IEEE International Conference on Big Data, pages 889 894, 2016. [Jabbour et al., 2017] Sa ıd Jabbour, Nizar Mhadhbi, Badran Raddaoui, and Lakhdar Sais. A SAT-based framework for overlapping community detection in networks. In PAKDD, pages 786 798, 2017. [Jabbour et al., 2018] Sa ıd Jabbour, Fatima Ezzahra Mana, Imen Ouled Dlala, Badran Raddaoui, and Lakhdar Sais. On maximal frequent itemsets mining with constraints. In CP, pages 554 569, 2018. [Jabbour et al., 2020] Sa ıd Jabbour, Nizar Mhadhbi, Badran Raddaoui, and Lakhdar Sais. SAT-based models for overlapping community detection in networks. Computing, 102(5):1275 1299, 2020. [Jabbour et al., 2022] Sa ıd Jabbour, Nizar Mhadhbi, Badran Raddaoui, and Lakhdar Sais. A declarative framework for maximal k-plex enumeration problems. In AAMAS, pages 660 668, 2022. [Mannila and Toivonen, 1997] Heikki Mannila and Hannu Toivonen. Levelwise search and borders of theories in knowledge discovery. Data mining and knowledge discovery, 1(3):241 258, 1997. [Morgado and Marques-Silva, 2005] Ant onio Morgado and Jo ao Marques-Silva. Algorithms for propositional model enumeration and counting. Technical report, Instituto de Engenharia de Sistemas e Computadores, Investigac ao e Desenvolvimento em Lisboa, 2005. [Szathmary et al., 2007] Laszlo Szathmary, Amedeo Napoli, and Petko Valtchev. Towards rare itemset mining. In ICTAI, pages 305 312, 2007. [Szathmary et al., 2012] Laszlo Szathmary, Petko Valtchev, Amedeo Napoli, and Robert Godin. Efficient vertical mining of minimal rare itemsets. In CLA, 2012. [Zaki, 1999] Mohammed J Zaki. Parallel and distributed association mining: A survey. IEEE concurrency, pages 14 25, 1999. [Zhang et al., 1996] Hantao Zhang, Maria Paola Bonacina, and Jieh Hsiang. Psato: a distributed propositional prover and its application to quasigroup problems. Journal of Symbolic Computation, pages 543 560, 1996. [Zhang et al., 2012] Shuyuan Zhang, Sharad Malik, and Rick Mc Geer. Verification of computer switching networks: An overview. In International Symposium on Automated Technology for Verification and Analysis, pages 1 16, 2012. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)